DataBase Structure (4.0.0a13) #145

spirillen · 2020-12-22T11:01:15Z

spirillen
Dec 22, 2020

Is your feature request related to a problem? Please describe.
As I'm looking into the DB, I find it hard to figure out how you would extract the INVALID from the expired or otherwise failed records from pyfunceble_inactive

Describe the solution you'd like
As referenced in https://www.mypdns.org/T1250 make a table of static values referring to a result such as

| id | Human_description |
| 1 | ACTIVE and VALID |
| 2 | WHOIS INACTIVE |
| 3 | INVALID SYNTAX |
| 4 | SYNTAX |
| 5 | AVAILABILITY |

and so on, now append these values to the pyfunceble_inactive and pyfunceble_continue table in addition to the checker type.

Case example.
We have been testing a damned huge source of 1.8million records from two to many clients, each client have been running with individual testing syntax, one with a availability test and one with ordinary syntax test (just examples as the source is tested for syntax and converted to IDNA in the DB)

Next step is a user would like to extract any INVALID records from the pyfunceble_inactive as they would like to delete these from own json|xml|csv of sources, this they would like to do by a script. The question is, how do you chose the right records from the pyfunceble_inactive ❓

Describe alternatives you've considered
None for now

Additional context
We could gain a huge bit saving by reusing such a table to hold the values of checker_type

Answered by funilrys

Dec 22, 2020

The pyfunceble_file table was in my mind useless. The pyfunceble_continue and pyfunceble_inactive are sufficient for PyFunceble (itself). In fact, I'm convinced that the JOIN I had to systematically to get the right dataset was slowing down PyFunceble (among other things in the past design).

Keep in mind that the data that are stored in there are not meant for usage outside of PyFunceble itself. In the past, because of a bad design, I choose to store all those data, so that I could generate the files at the real end (once I finished the tests of everything). That means that in the past, you could theoretically, sneak in the database and read and do your thing as long as it is working. Wit…

View full answer

spirillen · 2020-12-22T11:10:28Z

spirillen
Dec 22, 2020
Author

My second question to the new structure is why have you chosen to delete the pyfunceble_file and moved the duplicated megabytes of a source into the pyfunceble_(in)active tables??

That I don't understand.... as it as i see it at first will increase the share size of the DB with unnecessary 100's of megabytes.

0 replies

funilrys · 2020-12-22T13:00:57Z

funilrys
Dec 22, 2020
Maintainer

The pyfunceble_file table was in my mind useless. The pyfunceble_continue and pyfunceble_inactive are sufficient for PyFunceble (itself). In fact, I'm convinced that the JOIN I had to systematically to get the right dataset was slowing down PyFunceble (among other things in the past design).

Keep in mind that the data that are stored in there are not meant for usage outside of PyFunceble itself. In the past, because of a bad design, I choose to store all those data, so that I could generate the files at the real end (once I finished the tests of everything). That means that in the past, you could theoretically, sneak in the database and read and do your thing as long as it is working. With our new thread model for the CLI tests, the file writing is done almost instantly without any trick or hack like I used to do in the past.

In the past, because of the unsafety (while writing files) behind the parallel process, I chose not to write immediately into the output files. That's why most of us know through the Merging process... message which was mostly displayed after the tests.

Right now, when the testing task is done, it submits the result (itself) to the producer queue, which then (in parallel) generates the files in a sequential and safe mater. So, the reason I had such tables in the past, was because I was doing everything regarding the file generation under the multiprocessing mode, at the real end. Meaning that I had to store the data somewhere until the testing session was complete. With our new model, it's done in the background by a dedicated thread that is just waiting for new data.

Just look at it:

PyFunceble/PyFunceble/cli/threads/producer.py

Lines 351 to 403 in f557b0d

    
               def target(self) -> None: 
        
                   """int 
        
                   This is our core logic. Everything starts here! 
        
                   """ 
        
                   stop_message_caught = False 
        
                   while True: 
        
                       if self.the_queue.empty(): 
        
                           continue 
        
                       consumed = self.the_queue.get() 
        
                       PyFunceble.facility.Logger.debug("Got: %r", consumed) 
        
                       if consumed == "stop": 
        
                           PyFunceble.facility.Logger.info( 
        
                               "Got stop message. Stopping reading from the queue." 
        
                           ) 
        
                           stop_message_caught = True 
        
                           break 
        
                       if not isinstance(consumed, tuple): 
        
                           continue 
        
                       PyFunceble.facility.Logger.debug("Got: %r", consumed) 
        
                       test_dataset, test_result = consumed 
        
                       self.block_printer = False 
        
                       if isinstance(test_result, str) and test_result.startswith("ignored_"): 
        
                           self.run_ignored_file_printer(test_dataset, test_result) 
        
                       else: 
        
                           self.send_for_mining(test_dataset, test_result) 
        
                           self.run_autosave(test_dataset, test_result) 
        
                           self.run_inactive(test_dataset, test_result) 
        
                           PyFunceble.facility.Logger.debug( 
        
                               "Printer Blocked: %r", self.block_printer 
        
                           ) 
        
                           ## WARNING: DO NOT RUN PRINTER BEFORE the `run_inactive` or 
        
                           ## `run_autosave` methods. 
        
                           self.run_stdout_printer(test_result) 
        
                           self.run_file_printer(test_dataset, test_result) 
        
                           self.run_counter(test_dataset, test_result) 
        
                           self.run_whois_backup(test_result) 
        
                   if stop_message_caught: 
        
                       self.add_to_output_queue("stop")

We have to think about your proposition of changing the status to a number. But you'll have to convince me because I'm not ready to make the source code over complicated over this. Right now, the source code related to the dataset matches and use the dataset that is generated/produced by each checker. I unified every checker status so that I could manipulate the data without having to compute anything.

Meaning that when the tester threads send their results to the producer threads, the data is directly transferred to one of the dataset handlers which then works directly without any huge conversion. In other words, the dataset handlers are understanding information and directory work with them without any substantial non-human conversion. But you may be right about the storage size. I'm open for discussion though.

About the `pyfunceble_continue` table.

The purpose of this table (as of today) is just to store the list of the subject of a given file that were already tested. After a file test is done, all data related to the file we are testing is completely deleted. Meaning that yes it will store a copy of all your entries of the file you test. But once done, all data are removed.

When we compare, check the following indexes against the dataset we are going to test:

PyFunceble/PyFunceble/dataset/autocontinue/base.py

Lines 71 to 76 in f557b0d

    
           COMPARISON_FIELDS: List[str] = [ 
        
               "idna_subject", 
        
               "checker_type", 
        
               "destination", 
        
               "source", 
        
           ]

If all of those fields match, the subject will be ignored for testing.

About the `pyfunceble_inactive` table

The purpose of this table (as of today) is just to store all non ACTIVE subjects of a given file so that we can retest them later. We don't save any status, because we don't need to know or store any status for what we need to do.

When we compare, check the following indexes against the dataset we are going to test:

PyFunceble/PyFunceble/dataset/inactive/base.py

Lines 71 to 76 in f557b0d

    
           COMPARISON_FIELDS: List[str] = [ 
        
               "idna_subject", 
        
               "checker_type", 
        
               "destination", 
        
               "source", 
        
           ]

If all of those fields match, the subject will be ignored for testing unless the conditions for testing are met (inactive db deactivated, or DBR reached).

About the use case example:

Next step is a user would like to extract any INVALID records from the pyfunceble_inactive as they would like to delete these from own json|xml|csv of sources, this they would like to do by a script.

If I understand correctly you want to get the INVALID, so that you can remove them from the given source file? If yes, then just merge the ACTIVE and INACTIVE files (output directory) and you are done.

I know @mitchellkrogza used to do such a thing in the past, but right now, there is no reason anymore because the files are generated as we test.

The question is, how do you choose the right records from the pyfunceble_inactive ?

I just compare fields 😄 I now understand the pain: You want to access the database directly to do something else for your workflow.

That I don't understand.... as it as i see it at first will increase the share size of the DB with unnecessary 100's of megabytes.

Mhh that was more the case in the past. I was literally storying every single test status result in the database. When now, I just generate the corresponding file. But, we need to test with a huge dataset and see.

Facit

Why am I not doing one table which stores the subjects and another one which stores the results?

I really thought of it for a long time. We had past discussions about it. And yes in practice it's possible, I even wrote it. Then, later I changed my mind because it became clear that we don't need to store that much information (complete status for example) in the database.

The current solution (we have the chance to still be in alpha so we can still discuss it), is the best for what we do right now from the CLI. I agree that if you test millions of records, the size of your database will be growing. But that's the price - in my opinion - for a maintainable - and not slowing down - source code.

**What about the pyfunceble_status table?

I totally deleted the pyfunceble_status table because it was storing a lot of data that nobody actually needs because it's just about generating a status file in the end-effect. As said, in the past, I had to store all the datasets so that I could generate the results at a real end of the results. Right now it is not the case because I separated the tester from the producer. They are both daemon threads that are infinitely running and waiting for the dataset to test on one side and the result to generate on the other. And that, until they both get a stop signal.

Let's discuss everything 😉 Just have to get AFK a few hours first.

0 replies

spirillen · 2024-09-23T13:09:57Z

spirillen
Sep 23, 2024
Author

We have to think about your proposition of changing the status to a number. But you'll have to convince me because I'm not ready to make the source code over complicated over this.

This is regarding to the DB optimization, as reusing a value, are 1. faster, 2. use less space = less I/O.
This have nothing to do with the program operation it self at all. (How did I just happen to remember this, here 4 years later??)

is just to store all non ACTIVE subjects of a given file

Again a Huge DB optimization possibility: Only hold the domain/URI, the record and it's status don't care about how or where the records are coming from, and next time you are testing the source; maybe even same file, from are different path(URI), the same data are stored over and over again. leading to tons of data waste.

This is about pyfunceble.pyfunceble_continue.destination, pyfunceble.pyfunceble_continue.source & pyfunceble.pyfunceble_inactive.destination as this fields duplicate data(records). If you feels like keeping this data, make a new table that links to the pyfunceble.pyfunceble_continue.idna_subject & pyfunceble.pyfunceble_inactive.idna_subject

Skip the source ref, or at least reverse it, to be the source that links to the records, but in my universe, skip it entirely. Also as you mention

After a file test is done, all data related to the file we are testing is completely deleted

Short, when you starting a new test, link the source to the records for this test, and drop the contents after the test is stopped(incl completed)

This will also improve the , as you reuse the record data from previous test of same record, as the previous test are no longer fixed to s specified source, and your match for skip will be like:

 COMPARISON_FIELDS: List[str] = [ 
     "idna_subject", 
     "checker_type", 
 ]

Way faster and smaller DB table = faster search and rebuild on changes.

I just compare fields 😄 I now understand the pain: You want to access the database directly to do something else for your workflow.

Oh yeah, as mentioned a couple of times over the years, a DB is source for stored results(data) to be:

Extracted by any program with access, this be record types of any status (ACTIVE, INACTIVE, INVALID)
1. This way you can extract only the ACTIVE to share, read into DNS, while keeping the source intact. (just one of many usecases as the ACTIVE file does this, but the DB are often a better choice for sharing data)
Stored to setup multiply test env to test same source at once ( add more machine workers to test same source (private platform))
Source sharing to build a pyfunceble WHOIS DB platform to spare the origin WHOIS db's and avoid being blocked access to them. (OPT-in)
Source sharing of tested records aka the --platform-lookup & --push-platform, or even exchange the full test records with the cloud of users as a OPT-in. This would help minimizing the numbers of re-testing the same records from all over the world,
Ccase: Machine one at @mitchellkrogza just tested add-wildcard-domain (phishing) this morning, then I'm initiate a test of my phishing blacklists https://github.com/mypdns/matrix/blob/master/phishing/wildcard.list (phishing). By sharing the results and offloading the --platform-* central DB, my client would choose to relay on Mitchell's test results as I've received hit fresh test results.

Any of these methods for data sharing should always be OPT-in as PyFunceble are easily used to test private zones

Mhh that was more the case in the past

Yep, this issue are also 4 years old 😏

I totally deleted the pyfunceble_status

Good, it was only fill ❤️ and the status should be store with its record( which for performance are a ref to a third table with status)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataBase Structure (4.0.0a13) #145

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

DataBase Structure (4.0.0a13) #145

spirillen Dec 22, 2020

Replies: 3 comments

spirillen Dec 22, 2020 Author

funilrys Dec 22, 2020 Maintainer

About the pyfunceble_continue table.

About the pyfunceble_inactive table

About the use case example:

Facit

spirillen Sep 23, 2024 Author

spirillen
Dec 22, 2020

spirillen
Dec 22, 2020
Author

funilrys
Dec 22, 2020
Maintainer

About the `pyfunceble_continue` table.

About the `pyfunceble_inactive` table

spirillen
Sep 23, 2024
Author