Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement helper functions for dictionary and data management #182

Open
marcboulle opened this issue May 2, 2024 · 2 comments
Open

Implement helper functions for dictionary and data management #182

marcboulle opened this issue May 2, 2024 · 2 comments
Labels
Priority/1-Medium To do after P0 Status/Draft The issue is still not well defined

Comments

@marcboulle
Copy link

marcboulle commented May 2, 2024

Description

Suite aux travaux d'utilisation de Khiops sur le challenge Kaggle, il semblerait que quelques helpers dans pykhiops puissent faciliter les manipulations de dictionnaire et mieux répondre à certains besoins en data management.

Questions/Ideas

Un prototype rapide a été implémenté pour illustrer la prise en comte de certains besoins:

  • création d'un dictionnaire multi-table valide à partir de fichiers
    • évalué en recréant le dictionnaire multi-table des bases SpliceJunction et Accidents
  • accès à la structure d'un schéma (tables et relation) via des méthodes de DictionaryDomain et Dictionary
  • concaténation efficace de fichiers
  • tri d'un ensemble de fichiers de même nature en entrée pour avoir une seul fichier exploitable en sortie
  • ajout systématique de variables de type différence de date dans un flocon, à partir d'une date de
    référence de la table racine

Il s'agit juste d'un POC très rapide, en cours d'évaluation chez Romain et Nicolas.
Cela pourrait potentiellement déboucher sur l'ajout de quelques méthodes dans pykhiops core et quelques examples additiobnnels.

@folmos-at-orange folmos-at-orange changed the title Helpers for dictionary and data management Implement helper functions for dictionary and data management May 3, 2024
@folmos-at-orange folmos-at-orange added Priority/1-Medium To do after P0 Status/Draft The issue is still not well defined labels May 3, 2024
@folmos-at-orange folmos-at-orange self-assigned this Sep 4, 2024
@folmos-at-orange folmos-at-orange added Priority/0-High To do now and removed Priority/1-Medium To do after P0 labels Sep 4, 2024
@folmos-at-orange
Copy link
Member

folmos-at-orange commented Sep 4, 2024

Dictionary and Data Management Improvement Proposal

1) Make more accessible the state of a DictionaryDomain

Current Situation

While constructing a DictionaryDomain at any time we can use print to print its .kdic format.
However, for large dictionaries, it is very difficult to capture the structure and relations.

Proposal

  1. Implement the instance method DictionaryDomain.print_relations which will prints to the screen
    a hierarchical schema of the relations dictionaries in the domain.
  2. Implement the instance method DictionaryDomain.to_dot which returns a representation of the
    dictionaries and its relations in the dot format which can be consumed by
    GraphViz to visualize the schema.

Example

We have loaded the Accidents.kdic file into domain.

domain.print_schema()

Accidents(AccidentId)
|
+--1:1-- Place(AccidentId)
|
+--1:n-- Vehicle(AccidentId, VehicleId)
         |
         +--1:n-- User(AccidentId, VehicleId)


domain.to_dot()
# returns a string with dot code which allows to generate an image like this:

Acc

2) Ease the checking of dictionaries

Current Situation

Checking if a DictionaryDomain has no errors require either extra code to write a file or calling
a core API function.

Proposal

Implement the instance method DictionaryDomain.check that reports any error on the domain.

Example

Before

# First way
domain.export_khiops_dictionary_file("tmp.kdic")
kh.read_dictionary_file("tmp.kdic")

# Second way
check_database(domain, "bidon", "bidon.txt")

After

domain.check() # Raises ValueError if there are problem; otherwise does nothing

3) Simplify the creation of dictionary/domain objects from data table files

Current Situation

Adding to a DictionaryDomain a Dictionary built with build_dictionary_from_data_table requires
non-essential code to create intermediary files.

Proposal

Add a factory class method Dictionary.from_data_table which has following
parameters:

  • The data table file path
  • An optional name parameter to specify the dictionary name. By default the name of the dictionary
    is data table file name without extension.
  • All options of kh.build_dictionary_from_data_table.

Example

Before

domain = kh.DictionaryDomain()
kh.build_dictionary_from_data_table("Accidents.txt", "Accident", "Accidents.kdic")
tmp_domain = kh.read_dictionary_file("Accidents.kdic")
domain.add_dictionary(tmp_domain.get_dictionary("Accident"))

After

domain = kh.DictionaryDomain()
domain.add_dictionary(kh.Dictionary.from_data_table("Accidents.txt", name="Accident"))

4) Simplify the creation of relations between tables in a DictionaryDomain

Current Situation

Adding relations between tables require ~5 lines of non-essential code to create intermediary
Variable objects. Also, easily detectable errors such as that the child Dictionary exists are
only reported when using a Khiops API call.

Proposal

Add the instance method DictionaryDomain.add_relation with the following parameters:

  • The name of the parent table
  • The name of the child table
  • An optional is_unary=False parameter to specify whether the relation is 1-1 instead of 1-n.
  • An optional name=None parameter to set specify the name of the relation variable in the parent
    dictionary.
  • An optional key=None parameter to specify a relation to an external table. In this case key contains
    the column names of the join key and is_unary is ignored (only 1-1 relations are accepted).

The return value is the instance of Variable associated with the relation. The method would
perform minimal checks such as that both dictionaries exist in the domain.

Example

We suppose have the Accidents tables loaded into a DictionaryDomain instance called domain but
that there are no links between them. We want add the relations Accident -> Place (1-1) and
Accident -> Vehicle (1-n).

Before

accidents_kdic = domain.get_dictionary("Accident")
table_var = kh.Variable()
table_var.name = "Vehicles"
table_var.type = "Table"
table_var.object_type = "Vehicle"
accidents_kdic.add_variable(table_var)
entity_var = kh.Variable()
entity_var.name = "Place"
entity_var.type = "Entity"
entity_var.object_type = "Place"
accidents_kdic.add_variable(entity_var)

After

domain.add_relation("Accident", "Vehicle", name="Vehicles")
domain.add_relation("Accident", "Place", is_unary=True)

5) Allow creating simple multi-table domains from a simple dict spec

Current Situation

Creating multi-table dictionaries is a complex operation even with the improvements (2) and (3).

Proposal

Add a factory class method DictionaryDomain.from_spec which takes a dict dataset specification
(those used in the sklearn sub-module). This will allow the user to specify once its multi-table
dataset and obtain a dictionary with a single call. Moreover ensure that the keys are set as
Categorical.

Additionally, this may serve as well as an intermediary step for users to pass from the scikit-learn
interface to the core interface.

Example

For simplicity we will use the proposal (1) and (2) for the "before" part of this example.

Before

# Add the dictionaries from the tables
domain = kh.DictionaryDomain()
domain.add_dictionary(kh.Dictionary.from_data_table("Accidents.txt"))
domain.add_dictionary(kh.Dictionary.from_data_table("Places.txt"))
domain.add_dictionary(kh.Dictionary.from_data_table("Vehicles.txt"))
domain.add_dictionary(kh.Dictionary.from_data_table("Users.txt"))

# Add the relations
domain.add_relation("Accidents", "Places", is_unary=True, name="Place")
domain.add_relation("Accidents", "Vehicles")
domain.add_relation("Vehicles", "Users")

# Set the dictionary keys
domain.get_dictionary("Accidents").key = ["AccidentId"]
domain.get_dictionary("Places").key = ["AccidentId"]
domain.get_dictionary("Vehicles").key = ["AccidentId", "VehicleId"]
domain.get_dictionary("Users").key = ["AccidentId", "VehicleId"]

# Set Accident Id to `Categorical`
domain.get_dictionary("Accidents").get_variable("AccidentId").type = "Categorical"
domain.get_dictionary("Places").get_variable("AccidentId").type = "Categorical"
domain.get_dictionary("Vehicles").get_variable("AccidentId").type = "Categorical"
domain.get_dictionary("USers").get_variable("AccidentId").type = "Categorical"

After

domain = kh.DictionaryDomain.from_spec({
  "main_table": "Accidents"
  "tables": {
    "Accident": ("Accidents.txt", "AccidentId"),
    "Place": ("Places.txt", "AccidentId"),
    "Vehicle": ("Vehicle.txt", ["AccidentId", "VehicleId"]),
    "Users": ("Users.txt", ["AccidentId", "VehicleId"]),
  }
  "relations": [
    ("Accident", "Place", False),
    ("Accident", "Vehicle", True),
    ("Vehicle", "User", True),
  ]
})

6) Allow creating the additionnal_data_tables parameter from a simple dict spec

Current Situation

The additionnal_data_tables parameter may be difficult to manually for complex multi-table
datasets as it requires to assign each table file to each data path.

Proposal

Add the helper function kh.create_additional_data_tables_param which takes a simple dict dataset
specification and produces the necessary additionnal_data_tables ready to be used.

Example

Before

# ...
# After the ~20 lines on the "Before" part of example (5) to define the "domain" variable
kh.train_predictor(domain, "Accidents", "Accidents.txt", "Gravity", "./results.txt",
  additional_data_tables = {
    "Accidents`Places": "Places.txt",
    "Accidents`Vehicles": "Vehicles.txt",
    "Accidents`Vehicles`Users`: "Users.txt"
  }
)
# NB: The user may use domain.get_data_paths("Accident") to obtain the keys of the
# additionnal_data_tables dict.

After

spec = {
  "main_table": "Accidents"
  "tables": {
    "Accident": ("Accidents.txt", "AccidentId"),
    "Place": ("Places.txt", "AccidentId"),
    "Vehicle": ("Vehicle.txt", ["AccidentId", "VehicleId"]),
    "Users": ("Users.txt", ["AccidentId", "VehicleId"]),
  }
  "relations": [
    ("Accident", "Place", False),
    ("Accident", "Vehicle", True),
    ("Vehicle", "User", True),
  ]
}
domain = kh.DictionaryDomain.from_spec(spec)
kh.train_predictor(domain, "Accidents", "Accidents.txt", "Gravity", "./results.txt"
  additional_data_tables = kh.create_additional_data_tables_param(spec)
)

7) Allow to override the guessed a columns' types when building dictionaries from tables

Current Situation

The type detection in build_dictionary_from_data_table is heuristic and it may not return what the
user needs. For example:

  • Id fields detected as Numerical must be changed to Categorical
  • Short string fields detected as Categorical must be changed to Text if the user requires an
    AutoML treatment to it.

Proposal

Add a forced_types parameter to build_dictionary_from_data_table which takes a dict containing
the forced types by variable.

Note: This kind of setting is common in mainstream data frame libraries such as
Python's Pandas or
R's tibble.

Example

In the following example, we use Khiops to build a dictionary from the Accident table. The
build_dictionary_from_data_table heuristic considers part of the table's key AccidentId as
Numerical. We want to fix that.

Before

kh.build_dictionary_from_data_table("Accidents.txt", "Accident", "Accidents.kdic")
domain = kh.read_dictionary_file("Accidents.kdic")
domain.get_dictionary("Accident").get_variable("AccidentId").type = "Categorical"
domain.export_khiops_dictionary_file()

After

kh.build_dictionary_from_data_table("Accidents.txt", "Accident", "Accidents.kdic",
                                    forced_types={"AccidentId": "Categorical"})

8) Add a heuristic to set table keys as Categorical when building dictionaries from tables

Current Situation

See (7).

Proposal

Add an id_heuristic parameter to build_dictionary_from_data_table. If set to True then
a variable's type will be set as Categorical if its name matches these common table key
conventions:

  • At the beginning : id_, Id_, ID_, Id[A-Z]
  • At the end: _id, _Id, _ID, [a-z]Id

The heuristic would be turned off by default (id_heuristic=False) to preserve the current
behavior.

Example

Before

kh.build_dictionary_from_data_table(f"Users.txt", "User", "User.kdic")
domain = kh.read_dictionary_file("User.kdic")
domain.get_dictionary("User").get_variable("AccidentId").type = "Categorical"

After

kh.build_dictionary_from_data_table(f"Users.txt", "User", "User.kdic", id_heuristic=True)

9) Simplify the addition of simple variables

Current Situation

Adding simple variables to Dictionary requires too much intermediary code (similar to (2)).

Proposal

Change the Dictionary.add_variable instance method so:

  • It accepts str types in addition to Variable. The argument in that case is the name of the
    variable.
  • It has an optional parameter type. This parameter should be set to the type of the variable when
    its argument is its name.

Example

Our objective is to add 100 numerical variables to a Dictionary instance.

Before

kdic = kh.Dictionary()
for i in range(100):
  var = kh.Variable()
  var.name = f"Var{i}"
  var.type = "Numerical"
  kdic_add_variable(var)

After

kdic = kh.Dictionary()
for in range(100):
  kdic.add_variable(f"Var{i}", type="Numerical")

10) Improve rule construction in dictionaries

Current Situation

To create a rule variable, the user must set the rule field of a Variable instance. This field
is simply a str, thus dictionary syntax and escaping errors will be only reported once the user
uses the DictionaryDomain in a core API function.

Proposal

Add a DictionaryRule class whose constructor signature is:

  • The rule name
  • Any number of operands of type
    • str
    • int or float
    • kh.Variable
    • kh.DictionaryRule
  • An optional var_name parameter

Additionally the method method Dictionary.add_variable will be overloaded to accept
kh.DictionaryRule in which case var_name is its variable name in the dictionary.

This class enables the following features:

  • Check that the rule exists
  • Check that a variable name used as an operand is well escaped
  • Automatically set the variable type when using Dictionary.add_variable

Example

This example is unrealistically difficult, but it illustrates the advantages use of the proposed
DictionaryRule class: We'll try to reconstruct manually a Naive Bayes classifier from a single
interval partition for the Iris database (we suppose the Iris dictionary is loaded into
iris_kdic).

Before

vclass_var = kh.Variable()
vclass_var.name = "VClass"
vclass_var.type = "Structure"
vclass_var.structure_type = "DataGrid"
vclass_var.rule = """
  DataGrid(
    ValueSetC("Iris-setosa", "Iris-versicolor", "Iris-virginica"),
    Frequency(38, 32, 35)
  )
"""
iris_kdic.add_variable(vclass_var)

dg_var = kh.Variable()
dg_var.name = "PPetalLength"
dg_var.type = "Structure"
dg_var.structure_type = "DataGrid"
dg_var.rule = """
  DataGrid(
    IntervalBounds(3.15, 4.75, 5.15),
    ValueSetC("Iris-setosa", "Iris-versicolor", "Iris-virginica"),
    Frequencies(38, 0, 0, 0, 1, 26, 5, 0, 0, 0, 9, 26)
  )
"""
iris_kdic.add_variable(dg_var)

nb_var = kh.Variable()
nb_var.name = "NBClass"
nb_var.type = "Structure"
nb_var.structure_type = "Classifier"
nb_var.rule = "NBClassifier(DataGridStats(PPetalLength, PetalLength) , VClass)"
iris_kdic.add_variable(dg_var)

After

vclass_rule = kh.DictionaryRule(
  "DataGrid",
  kh.DictionaryRule("ValueSetC" , "Iris-setosa", "Iris-versicolor", "Iris-virginica"),
  kh.DictionaryRule("Frequencies", 3.15, 4.75, 5.15),
  var_name="VClass"
)
iris_kdic.add_variable(vclass_rule)

dg_rule = kh.DictionaryRule("DataGrid",
  kh.DictionaryRule("IntervalBounds", 3.15, 4.75, 5.15),
  kh.DictionaryRule("ValueSetC", "Iris-setosa", "Iris-versicolor", "Iris-virginica"),
  kh.DictionaryRule("Frequencies", 38, 0, 0, 0, 1, 26, 5, 0, 0, 0, 9, 26),
  var_name="PPetalLength"
)
iris_kdic.add_variable(dg_rule)

nb_rule = kh.DictionaryRule("NBClassifier",
  kh.DictionaryRule(
    iris_kdic.get_variable("PPetalLength"),
    iris_kdic.get_variable("PetalLength"),
  var_name="NBClass"
)
iris_kdic.add_variable(nb_rule)

11) Simplify the access to the Dictionarys/Variables in DictionaryDomain/Dictionary

Current Situation

In a DictionaryDomain and Dictionary we often use its accessors get_dictionary and
get_variable. Since these names are very long and usually used in sequence the access to
a variable properties results in long lines.

Proposal

Add dict semantics to these classes. This would allow to use the [] operator to access the
dictionaries or variables contained.

Example

We have loaded the tables of the Accidents dataset into domain. We want to set the AccidentId
variable as Categorical in each table.

Before

domain.get_dictionary("Accidents").get_variable("AccidentId").type = "Categorical"
domain.get_dictionary("Places").get_variable("AccidentId").type = "Categorical"
domain.get_dictionary("Vehicles").get_variable("AccidentId").type = "Categorical"
domain.get_dictionary("Users").get_variable("AccidentId").type = "Categorical"

After

domain["Accidents"]["AccidentId"].type = "Categorical"
domain["Places"]["AccidentId"].type = "Categorical"
domain["Vehicles"]["AccidentId"].type = "Categorical"
domain["Users"]["AccidentId"].type = "Categorical"

12) Simplify the creation of selection variables

Current Situation

Creating selection variables for common situations requires advanced knowledge of dictionary rules.

Proposal

i) Create the instance method Dictionary.add_selection_variable that takes the parameters

  • The name of the created variable
  • An optional min parameter to specify the minimum value for a Numerical selection
  • An optional max parameter to specify the maximum value for a Numerical selection
  • An optional values to specify a list of values for a Categorical selection

The method returns a reference to the created variable.

ii) Implement the and, or and not operators in Variable. This way, the user can easily mix
conditions.

Example

We want to filter by two criteria the Accidents database:

  • One to select the accidents outside latitudes 46 and 50
  • Other to select only the weather conditions FogOrSmoke or HeavyRain

Before

latitude_sel_var = kh.Variable()
latitude_sel_var.name = "LatitudeSelection"
latitude_sel_var.type = "Numerical"
latitude_sel_var.rule = "If(And(GE(Latitude, 46), LE(Latitude, 50)))"
accidents_kdic.add_variable(latitude_sel_var)

weather_sel_var = kh.Variable()
weather_sel_var.name = "WeatherSelection"
weather_sel_var.type = "Numerical"
weather_sel_var.rule = 'InGroup(ValueGroup("FogOrSmoke", "HeavyRain"), Weather)'
accidents_kdic.add_variable(weather_sel_var)

sel_var = kh.Variable()
sel_var.name = "Selection"
sel_var.type = "Numerical"
sel_var.rule = "And(Not(LatitudeSelection), WeatherSelection)"
accidents_kdic.add_variable(sel_var)

After

latitude_sel_var = accidents_kdic.add_selection_variable("LatitudeSelection", min=46, max=50)
weather_sel_var = accidents_kdic.add_selection_variable(
  "WeatherSelection", values=("FogOrSmoke", "HeavyRain")
)
accidents_kdic.add_variable(not latitude_sel_var and weather_sel_var)

13) Simplify sort_data_table for splitted data files

Current Situation

It is common to find datasets where one data table is split into many files. Since Khiops accepts
only single files for each table, the user needs to concatenate them before sorting. This operation
requires intermediary files.

Proposal

Make the sort_data_table accept besides data tables:

  • list of files
  • glob expressions

Example

We have the main table Accidents splitted into four files:
Accidents_01.txt, ..., Accidents_4.txt. We want to prepare this file to be used with Khiops, this
means concatenate and then sort it.

Before

table_file_paths = ["Accidents_01.txt", "Accidents_02.txt", "Accidents_03.txt", "Accidents_04.txt"]
with open("UnsortedAccidents.txt", "wb") as output_file:
    for table_file_path in table_file_paths:
        with open(f, "rb") as table_file:
            shutil.copyfileobj(table_file, output_file)
sort_data_table("Accidents.kdic", "Accident", "UnsortedAccidents.txt", "Accidents.txt")

After

# With list of files
table_file_paths = ["Accidents_01.txt", "Accidents_02.txt", "Accidents_03.txt", "Accidents_04.txt"]
sort_data_table("Accidents.kdic", "Accident", table_file_paths, "Accidents.txt")

# With globs
sort_data_table("Accidents.kdic", "Accident", "Accidents_*.txt", "Accidents.txt")

@folmos-at-orange folmos-at-orange added the Type/Feature A new feature request or an improvement of a feature label Sep 4, 2024
@folmos-at-orange
Copy link
Member

After meeting 18/09, to respecify in detail:

  • High Priority: 3-4-9-10
  • Medium Priority: 1-11-12
  • Low Priority: 2-7-8-13

We'll have soon a special meeting for 5-6 to discuss the status of the dict spec.

@folmos-at-orange folmos-at-orange removed their assignment Sep 23, 2024
@popescu-v popescu-v added Priority/1-Medium To do after P0 and removed Priority/0-High To do now labels Jan 9, 2025
@folmos-at-orange folmos-at-orange removed the Type/Feature A new feature request or an improvement of a feature label Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority/1-Medium To do after P0 Status/Draft The issue is still not well defined
Projects
None yet
Development

No branches or pull requests

3 participants