Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clean import data #34

Closed
wants to merge 18 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions apis_ontology/management/commands/create_base_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
import os
from django.core.management.base import BaseCommand
from apis_core.apis_relations.models import Property
from apis_ontology.models import Archive, Person, WorkType
from apis_ontology.scripts.additional_infos import ARCHIVES, PERSONS, WORK_TYPES
from apis_ontology.scripts.import_helpers import create_triple, create_source

fname = os.path.basename(__file__)


def create_archives(calling_file=fname):
"""
Create objects for Archive entity.

:param calling_file: optional argument to pass filename of the calling
script, otherwise uses this file's name
"""
import_name = "Archives_Import"

for a in ARCHIVES:
# for archives, save XML file as pubinfo when creating sources
# for later reference
source, created = create_source(import_name, metadata=a["source_file"])
Archive.objects.get_or_create(
name=a["name"],
defaults={"data_source": source},
)


def create_persons(calling_file=fname):
"""
Create objects for Person entity.

:param calling_file: optional argument to pass filename of the calling
script, otherwise uses this file's name
"""
import_name = "Persons_Import"
source, created = create_source(import_name, metadata=calling_file)

for p in sorted(PERSONS, key=lambda d: d["id"]):
Person.objects.get_or_create(
name=p["name"],
first_name=p["first_name"],
last_name=p["last_name"],
defaults={"data_source": source},
)


def create_types(calling_file=fname):
"""
Create objects for WorkType entity.

:param calling_file: optional argument to pass filename of the calling
script, otherwise uses this file's name
"""
import_name = "WorkTypes_Import"
source, created = create_source(import_name, metadata=calling_file)

# types with parents, not top-level types
children = {key: val for (key, val) in WORK_TYPES.items() if val["parent_key"]}

# create objects for all types
for work_type in WORK_TYPES.values():
wtype, created = WorkType.objects.get_or_create(
name=work_type["german_label"],
name_plural=work_type["german_label_plural"],
defaults={"data_source": source},
)

for work_type in children.values():
wt_object = WorkType.objects.get(name=work_type["german_label"])
parent_key = work_type["parent_key"]
parent_object = WorkType.objects.get(
name=WORK_TYPES[parent_key]["german_label"]
)
create_triple(
entity_subj=wt_object,
entity_obj=parent_object,
prop=Property.objects.get(name="has broader term"),
)


class Command(BaseCommand):
def handle(self, *args, **options):
create_archives()
create_persons()
create_types()
189 changes: 189 additions & 0 deletions apis_ontology/management/commands/delete_source_objects.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
import itertools
from django.core.management.base import BaseCommand
from apis_core.apis_metainfo.models import Source
from apis_core.utils.caching import (
get_all_entity_class_names,
get_entity_class_of_name,
)


class Command(BaseCommand):
# TODO allow removal of source itself when/once empty

help = "Delete entity objects from specific Sources."

all_sources = Source.objects.all()
source_names = [s.orig_filename for s in all_sources]
# allow targeting of objects which don't belong to a Source; useful
# e.g. when objects were previously imported without assigning a Source
# or when a Source was deleted (by name) but its objects remained
# TODO rework 'NULL' sources so they remain available as an option
# but aren't included when deleting 'ALL_SOURCES'
source_names.append("NULL")

missing_args_message = (
"\n"
"You need to provide both the name of a source "
"and an entity to delete from it!"
)

def add_arguments(self, parser):
# optional arguments
group = parser.add_mutually_exclusive_group()
group.add_argument(
"-l",
"--list",
dest="list",
action="store_const",
const=True,
help="List all available sources.",
)
# positional arguments (required)
parser.add_argument(
"source",
nargs=1,
type=str,
help="Name of source for which to remove entity objects.",
)
parser.add_argument(
"entity",
nargs="+",
type=str,
help="Name of model class from which to remove those objects.",
)
# optional arguments
parser.add_argument(
"-n",
"--dry-run",
dest="skip",
action="store_const",
const=True,
help="Dry-run deletion action. "
"Does not actually delete objects from the database.",
)

def handle(self, *args, **options):
src_id = "-1"
msg_prefix = ""
skip = None
source_obj = None
entities = []
entities_failed = []

all_sources = Source.objects.all()
all_entities = get_all_entity_class_names()
entity_names = [m for m in all_entities]

if options["skip"]:
skip = True
msg_prefix = "DRY RUN – "

source_name = options["source"][0]
if source_name == "ALL_SOURCES":
source_obj = Source.objects.all()
elif source_name != "NULL":
source_obj = Source.objects.filter(orig_filename=source_name)
if len(source_obj) == 0:
for src in self.source_names:
self.stdout.write(src)
self.stdout.write(
self.style.ERROR(
f"The supplied Source {source_name} does not exist, "
f"please choose one from the above list."
)
)
exit(1)
else:
if len(source_obj) > 1:
self.stdout.write(
"There are several source objects with the given name. "
"Please provide the ID for the source from which to delete objects:"
)
src_ids = list(source_obj.values_list("id", flat=True))
src_ids_str = [str(x) for x in src_ids]
src_ids_str.append("ALL")

for src in source_obj:
self.stdout.write(
f"{src.id}, {src.orig_filename}, {src.pubinfo}"
)

while src_id not in src_ids_str and src_id != "ALL":
src_id = input()

if src_id != "ALL":
source_obj = Source.objects.get(id=src_id)
else:
source_name = "(NULL)"
source_obj = Source.objects.filter(orig_filename=source_name)

entities_provided = options["entity"]
for ent_prov in entities_provided:
ent = ent_prov.split(",")
entities.extend(ent)

if entities[0] == "ALL_ENTITIES":
# shortcut to allow deletion of all entity objects
# with a given source
entities = entity_names

for ent in entities:
if ent not in entity_names:
entities_failed.append(ent)
else:
ent_class = get_entity_class_of_name(ent)

# if len(source_obj) > 1:
# print(source_obj[0].orig_filename)
# exit()
# else:
try:
ent_obj = ent_class.objects.filter(source__in=source_obj)
except:
print(f"No objects left for source {source_obj.orig_filename}.")
exit(0)

if not source_obj:
ent_obj = ent_class.objects.filter(source__isnull=True)

obj_count = len(ent_obj)

success_msg = (
f"Deleted {obj_count} {ent} objects from Source " f"{source_name}"
)
nothing_todo_msg = (
f"No {ent} objects to delete from Source {source_name}."
)

if skip:
success_msg = msg_prefix + success_msg
nothing_todo_msg = msg_prefix + nothing_todo_msg

if obj_count > 0:
self.stdout.write(self.style.SUCCESS(success_msg))
for obj in ent_obj:
delete_msg = f".. Deleted {ent} object {obj}."
delete_err_msg = f"Failed to delete '{obj}'."
if skip:
delete_msg = msg_prefix + delete_msg
delete_err_msg = msg_prefix + delete_err_msg

try:
if not skip:
obj.delete()
self.stdout.write(delete_msg)
except Exception as e:
self.stdout.write(self.style.ERROR(delete_err_msg))
self.stdout.write(self.style.ERROR(e))

else:
self.stdout.write(nothing_todo_msg)

if len(entities_failed) > 0:
self.stdout.write(self.style.ERROR("The following entities do not exist:"))

for ent in entities_failed:
self.stdout.write(self.style.ERROR(ent))

self.stdout.write("Available entities:")
self.stdout.write(", ".join(entity_names))
13 changes: 13 additions & 0 deletions apis_ontology/management/commands/run_ontology_script.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
from django.core.management.base import BaseCommand
import importlib


class Command(BaseCommand):
def handle(self, *args, **options):
script = importlib.import_module(
f"apis_ontology.scripts.{options['ontology_script']}"
)
script.run()

def add_arguments(self, parser):
parser.add_argument("ontology_script")
45 changes: 45 additions & 0 deletions apis_ontology/scripts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Import scripts for Frischmuth data

The import process is split across three scripts, _which need to be run in order_:

1. [import_xml](import_xml.py)
2. [import_zotero_collections](import_zotero_collections.py)
3. [import_nonbibl_entities](import_nonbibl_entities.py)

## Prerequisites

Two of the scripts require access to a separate private repository **vorlass_data_frischmuth**, which contains XML data pertaining to objects which are part of Barbara Frischmuth's Vorlass. This repository is assumed to sit at the root of the Git superproject. When it's used as another submodule, make sure the superproject points to its latest commit.

One script requires access to a private [Zotero](https://www.zotero.org/) library. It expects values set for environment variables `ZOTERO_API_KEY` and `ZOTERO_LIBRARY_ID`. If it can't find them, it will prompt you to enter these values manually.

Before running the scripts, make sure all properties between entities have been created. I.e. the `create_relationships` management command needs to have been run once before.

When developing locally, also take care to have environment variables for your local database set, or use the `--settings` parameter to point to a local settings files with variables for your DB.

## Import all Vorlass data

This script needs access to files contained within the `vorlass_data_frischmuth` repository.

Run the import with:
```sh
$ python manage.py run_ontology_script import_xml
```

## Import data from Zotero

This script needs access to a project-specific Zotero library.

Run the import with:
```sh
$ python manage.py run_ontology_script import_zotero_collections
```

## Import non-bibliographic entities

This script needs access to files contained within the `vorlass_data_frischmuth` repository.

Run the import with:
```sh
$ python manage.py run_ontology_script import_nonbibl_entities
```

Empty file.
Loading
Loading