Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write a Dart Script Inputs VRI Tipitaka XML Files and Outputs SQL Files For Inserting Into SQLite DB #219

Open
iulspop opened this issue Oct 30, 2023 · 11 comments

Comments

@iulspop
Copy link

iulspop commented Oct 30, 2023

Original request from Bhante Subhuti:

Yes.. let me show you one.. I think we have this book.. but it does not matter..
This is the xml format.
https://github.com/VipassanaTech/tipitaka-xml/blob/main/romn/e0201n.nrf.xml

We need to make an sql txt import file that writes to the pages categories books and tocs tables

You can look at the working from the file that is installed from the app store..
To find the db.. you just go to settings/helpaboutetc/reset data
DO NOT RESET.
The db directory will be shown there.

There is also an immediate need to add a "simple" field to the current books that are extensions. You can start with that.. the IIT chanting book.

The simple field will have the diacritical characters removed. (leave ñ alone)
I can give you dart code that makes the simple

There are 2 ways it can be done.. One is just modifying our sql file.. The other option is to modify the program that made the files. 2nd method is preferred.

This book has fake pages.. made up.. they cannot be too big .. nor too small or else we get performance issues.

For the xml, you will need to match the codes.. we want to keep codes that tell us the page, alt readings, and other book pages and paragraphs.

Start with this.
You can use the sqlite db browser to see if the imports work.
You can delete based on bookid delete from pages where bookid=xyz

see attached sql The attached sql will show you the format we want.
The job for this sql and several others.. is to change the code that generated it.. add an extra field on the toc inserts. I'll send you that code later.

Bhante Subhuti

Here's is what I've understood the task is:

Write a Dart script which processes each XML file of the Roman script version of the Tipitaka provided by VRI and outputs an SQL file for importing the book into the SQLite DB.

Each XML file maps to a book in the books table, each book is related to one category in the category table, each book is related to many pages in the pages table,and each book is related to a toc in the tocs table.
image
image
image

"We need to make an sql txt import file that writes to the pages categories books and tocs tables"

You do not mention the paragraphs table here, should that be written to as well?

There is also an immediate need to add a "simple" field to the current books that are extensions. You can start with that.. the IIT chanting book.
The simple field will have the diacritical characters removed. (leave ñ alone)
I can give you dart code that makes the simple

I would add the "simple" field to what table? the pages table? It has the "content" text field.

For the xml, you will need to match the codes.. we want to keep codes that tell us the page, alt readings, and other book pages and paragraphs.

What specific XML codes are you referring to? In the XML files all I see are paragraph numbers like:

<p rend="bodytext" n="8"><hi rend="paranum">8</hi><hi rend="dot">.</hi> (Ka) dassanena pahātabbā dhammā.</p>

I don't see information about pages, alt readings, and other book pages and paragraphs.

There are 2 ways it can be done.. One is just modifying our sql file.. The other option is to modify the program that made the files. 2nd method is preferred.
The job for this sql and several others.. is to change the code that generated it.. add an extra field on the toc inserts. I'll send you that code later.

Where is the program that made the SQL files for importing the books like the one you shared?

The SQL file for importing the chanting book you shared: iit_chantingbook.sql.txt

Log Of What I've Investigated So Far (These are more notes to myself)

I downloaded the app "Tipitaka Pali Reader" from the App Store on my desktop MacOS, and found the SQLite .db file at /Users/iulspop/Library/Containers/org.americanmonk.tpp/Data/Documents/tipitaka_pali.db.

I also learned I can clone this repo and run:

cd tipitaka-pali-reader/assets/database
gdown 1II8XYSQw0JzZxJk2J4QT9XyN2SnqT9qm
unzip tipitaka_pali.zip

To download the unsplit tipitaka_pali.db file.

I downloaded the DB Browser for SQLite to explore the schema in a GUI.

I then look at the structure of the VRI .xml files

image

It looks like for each of the seven Abhidhamma Piṭaka books there's a .att.xml file for the "aṭṭhakathā" or commentary, an .tik.xml file for the "mūlaṭīkā" or sub-commentary, and a .mul.xml file for the book.

I don't understand what the .nrf.xml files are. Some are anuṭīkā texts, which I think means "sub-sub-commentary"? Others are not from the "Abhidhammapiṭake" nikaya but from other nikaya like "Abhidhammāvatāra-purāṇaṭīkā", or don't have a nikaya attribute at all but only a book title like "Abhidhammatthasaṅgaho". I suppose their additional texts not part of the Pali Canon?

I found this "Essence of the Tipitaka" document by VRI a good reference for understanding what texts these various .xml files refer: https://www.tipitaka.org/eot

I'm starting to see a structure.

There's an abh series of XML files which contain the Abhidhamma Piṭaka, it's commentaries and sub-commentaries and additonal related texts.

There's a e series of files that seems to be extra Pali books outside the Tipiṭaka.

There's an s series of files that are part of the Sutta Piṭaka and it's commentaries and sub-commentaries.

Then there's a vin series of files that are part of Vinaya Piṭaka.

The XML files have these elements (haven't gotten a comprehensive list yet):
head
dix
p paragraph
pb page break
teiHeader
text
hi highlighted
note

p elements often have a rend attribute, like:
centre
nikaya
title
book
subsubhead
gatha1
gathalast
subhead
bodytext
indent
gatha2
gatha3
chapter
unindented
hangnum

@iulspop iulspop changed the title Write a Dart Script That Converts Tipitaka XML format to SQL DB Format Write a Dart Script Inputs VRI Tipitaka XML Files and Outputs SQL Files For Inserting Into SQLite DB Oct 30, 2023
@bksubhuti
Copy link
Owner

You have made some great progress all on your own.
There is no need for paragraphs table.
There is a more immediate need to make extensions for the books we are missing. I'm not sure If I asked this with you or another Lao monk. Putting here is a good idea.
We are missing some books that are found in vir and also tipitaka.app
It is a priority and a good way to practice, to get these books working as an extension.
They are independent from linked books and if you get the pages wrong or off by one.. it does not matter so much.

The simple field in toc is no longer used. I will remove that.
The initial query is small enough that we can get all toc items for a book and then filter locally.

It would be good to do a call on google meet.

@iulspop
Copy link
Author

iulspop commented Oct 30, 2023

Missing anna books:
all Saṃgāyana pucchā
all ledi sayadaw
all buddhavandana
all vansagatha
from grammar Bālāvatāra
all nithigantha
all pakinaka gantha
all sinhala gantha

Prioritize a Saṃgāyana pucchā book or a missing Ledi Sayadaw book

I'll start with "Patanudessa"

Get it into our system is prioriorty, focus on Myanmar paragraph numbers and real pages.

@iulspop
Copy link
Author

iulspop commented Oct 30, 2023

reorganize tpr_downloads to have a release dir where we put .zip with sql file for importing the anna texts (later whole VRI Tipitaka)

@iulspop
Copy link
Author

iulspop commented Oct 30, 2023

TODOS

Iuliu:

  • Look at Patanudessa book, write down input and expected output, then verify with Bhante Subhuti the expected output is correct

Bhante Subhuti:

  • Get the codes for ALT readings, page and paragraph numbers in various editions

@bksubhuti
Copy link
Owner

I sent request to janaka for the vri codes
sent a request to a monk to give the name for a priority book.
Added you to collaborate on tpr_downloads.

@bksubhuti
Copy link
Owner

Anudīpanīpāṭha was suggested.
https://tipitaka.org/romn/cscd/e0401n.nrf0.xml

and if you want to do sanghayana pucca , you can also do that.
First book of first folder is here.
https://tipitaka.org/romn/cscd/e0901n.nrf0.xml

You can choose which one .. probably the ledi sayadaw book will be easier.

@bksubhuti
Copy link
Owner

The message I got back was this..

"I don't think the codes are documented anywhere. At least I haven't seen. also I don't think the codes are too complicated to understand studying oneself which is what I did. when you go through looking at the XML file you will intuitively understand what the codes mean. of course if he has any questions I would be happy to answer as well. The problem with making documentation is that I will have to go through a file and try to understand them again since I have forgotten all of it. So it is best to ask questions when you have and I will be happy to answer."

He is on facebook under the name Path Nirvana So you can send him a message if you need help.
I think you might be able to leave them.."as is" .. we can see later.
The code investigation would be better with a mula book.. for instance majjhima nikaya. There will be different versions for page numbers and different alt readings.
You can find that by ctrl click on tipitaka.org (and go to the website that displays the pali) and then match it with the github link.

I think this is the link here

@bksubhuti
Copy link
Owner

by comparing mn 1.
I think..the alt readings has a note tag.
image

and the books should be aligned in the beginning .. so we should know.. paranum.. and 3 books.
I'll ask what the letters are but it should not matter.. and we probably have the same code pasted verbatim.
image

@iulspop
Copy link
Author

iulspop commented Nov 3, 2023

Hi @bksubhuti, I read more carefully the TPR Downloads repo and I understand now that all of the SQL files there are for importing extensions but not the main Tipitaka texts. I wonder how were the current Tipitaka texts available in TIptitak Pali Reader imported? Are there SQL files or code used to generate SQL files for them available anywhere? Or were they imported manually somehow? Or maybe you reused the already loaded database from the Myanmar only app?

In any case, I dumped the current DB data to SQL to get started. I followed up on this issue in this PR draft: bksubhuti/tpr_downloads#2. Let's continue the conversation there.

@bksubhuti
Copy link
Owner

I will forward a message to @pndaza . He is more familiar the format. Hopefully he can comment and answer your questions. It is important or critical to have the page breaks match his page breaks, especially for the main texts. The main texts are a great tool for a learning exercise rather than the añña books which don't have links in them.

The original design made several years has the three top level categories hard-coded. It has caused some issues with searching and we would like to fix this. I thought we had an issue to fix this, but I cannot seem to find it. If you go to book_list_page.dart you will find the correct codes for the topmost level categories. I'm going to breakfast now.. but I think you are exceeding my knowledge of the texts now. Great job. I'll try to send ven pndazza to your PR and also merge this. You have direct access as well to push.

@bksubhuti
Copy link
Owner

Note to self and update on progress:
All books are can be imported with sql.
Need to break up the sections and make zipped extensions.
Ledi Sayadaw section is set to be finished next weekend (July 14).
The sql script should take care of multiple installs without duplicates. (need to delete previous instances, "if exist") category book pages and tocs ( or conditionally add the category).
The sql script should also remove previous ledi sayadaw books from the annya section that are not grouped under this new heading (if exists). book pages and tocs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants