-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial database schema #53
base: v5-develop
Are you sure you want to change the base?
Conversation
docs/database/0002_account.sql
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using Serial cleans up the setup for 1) auto-increment, 2) NOT NULL and 3) Unique Primary Keys.
For the Foreign Keys (references), the default behavior is that you will not be able to delete a row if its being referenced in another table. There is nothing wrong with this approach, the only thing I would say is if there are going to be frequent deletes to tables with foreign keys, then the deletes have to be done in a specific order, or you can change the behavior of the foreign key to "CASCADE" on delete/update. This will make sure the other references are updated/deleted as well. There is a potential performance impact here.
docs/database/0003_clip.sql
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the account_id column (referencing the ID in account)...I think you have to have the same "ON DELETE CASCADE" behavior in other tables that are also referencing ID in the account table. I will circle back on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for start_time/end_time-is there a reason why you want to use numeric? Is to store epoch time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the account_id column (referencing the ID in account)...I think you have to have the same "ON DELETE CASCADE" behavior in other tables that are also referencing ID in the account table. I will circle back on this
Ah, I am missing all of the ON DELETE CASCADE rules in the account sql file. I will review the full code and try to determine which are missing, and which are unnecessary.
for start_time/end_time-is there a reason why you want to use numeric? Is to store epoch time?
No, that is a good catch. We only need up to 2 decimal points for start/end time.
Since we only need 2 decimal points, I guess we will use the FLOAT type...however, I suppose we could make it a numeric and limit the numeric to 2 decimal points? While the latter seems more efficient, I suspect numeric values maybe quirkier than floats (I think I saw that both 0 and 0.0 are possible separate numeric values), so I'll go ahead with FLOAT unless we can think of good reasons to go with numeric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you not just use the datatype "time" there? its format is 00:00:00.
https://www.postgresql.org/docs/current/datatype-datetime.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I hadn't thought of that. I guess we could use it since it has microsecond resolution? The only reason I can think of possibly not however, is that when parsing "time position" values from RSS feeds, or using them in a media player (ex. start playback from 90 seconds in), the values will be integers or decimals, not hh:mm:ss:ms...so I'm guessing we would need to convert those values to and from hh:mm:ss:ms time?
If there isn't an advantage, like a performance benefit, for using datatype "time", then I think I'd prefer to use "float".
docs/database/0004_playlist.sql
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know code reuse is something we all strive for, which 'INHERITS' allows for, but this can cause some performance overhead because you will be querying more than one table. https://www.postgresql.org/docs/current/tutorial-inheritance.html. I generally don't use this in my tables, only because I don't mind writing out the same columns over (and over) again. Sure, it makes for more reading, but it isn't going to cause any potential performance hit that 'INHERITS' does. But, as the documentation says, its great for object-oriented databases. I would suggest keeping this in place for now, but if you feel like you want to start squeezing performance out of the database-I would start here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am coming back from speaking with my colleague and I think it would be wise to not use inherit. Here is why;
By using inherit, you are essentially created a link between two tables for the sake of cleaning up code (not having to write out the same code to create the same type of columns in several tables). The postgres documentation actually gives a good example using 'Cities' table and 'Capitals' table. The 'Capitals' table inherits the 'Cities' columns plus addition columns added specific to 'Capitals'. The query hitting the Cities table will then, essentially be querying two tables (not just Cities). If you are to look for cities that are not capitals, you then have to use specific syntax of 'FROM ONLY CITIES'. Seems kind of silly when you could just essentially have a cities table with a boolean column labeled 'capital' with a default of false.
So-TLDR-I would avoid using inherit and just write out the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very glad you and your colleagues looked into this. The INHERITS
usage is totally new to me, and I'm not sure if it would help or hurt more long run...
Part of why I used INHERITS
is not only for code usability, but because we have concepts like a "playlist", and a playlist can have multiple different table types associated with it. For example, a playlist consists of playlist_items, and each playlist_item can have a foreign key relationship with 4 different data types: item, clip, chapter, soundbite.
I was thinking that if we use inherits, then when it is time to query the database for ALL of the playlist_items for a account's/user's playlist, then we could query the playlist_item_base
table using the playlist_id
, and use UNION
so each playlist_item
returned would select the different columns that correspond with the different tables that inherit playlist_item_base
.
Does that make sense, or am I making incorrect assumptions? Is there another / better way to do this besides this INHERITS
plus UNION
approach?
(P.S. I know the table names may semantically sound like they can be consolidated into one table, like why can't "clip" and "soundbite" be the same table? But these are distinct entities according to RSS spec, so I'd suggest ignoring the apparent similarities for now.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I mentioned this in the matrix channel, but I think the idea would be less on the table setup and more in the query to get the data you need. Because you can get the same values using joins, and if there are indexes on those tables, the performance should be better than using INHERITS
.
For example
SELECT a.column, b.column
FROM table_1 a
INNER JOIN table_2 b on a.id = b.id
WHERE a.column <> 42
But this would mean your sql queries will need to do the work with the joins, and not using the backend "magic" of INHERITS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, if the performance should be better than using inherits, we're definitely trying to choose performance wherever possible. If not using inherits allows that, I'd like go with that option, but I'd like to explain the use case one more time to be sure, before I stop using inherits.
Our playlists will have "playlist items" in them like this, of 5 potentially different types:
- clip
- item (episode)
- chapter
- soundbite
- item (episode)
- chapter
- clip
- add_by_rss_item (episode)
- etc
We'll need to make paginated queries, for example for the first 20 of those playlist items for a specific playlist, ordered by the position
column.
Also, when updating a position
of a playlist item (ex. the playlist item at position 7 moves to position 3), we will need to get the playlist items with the next highest and lowest position values, so that the playlist item in the new position can be given a position value in between those 2 values.
If there is a playlist_item_base
table, then I was thinking it might be more performant to query that table for the items ordered by position, rather than querying all 5 of the playlist item tables, and ordering the items by position, and then somehow stitching the requested range (ex. the next 20 playlist items by position from a playlist) of items together and returning them.
Do you think that querying the 5 tables separately could be more performant than a base table for this purpose? (Part of my problem may be that I don't yet know what a query that can stitch and return this paginated data together would look like.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Id say stick with the inherit since you have already been investing in the design (in time and effort). What we could do down the line would be to put time into writing the query to get what you need and compare the performance. Ultimately, you want to do something that not just you will be able to understand, and inherit gives people that breadcrumb trail. Keep on keeping on.
|
||
CREATE OR REPLACE FUNCTION check_account_id_uniqueness() | ||
RETURNS TRIGGER AS $$ | ||
BEGIN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good use case for a UNION instead of a JOIN here. Nice work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you :)
That said...we are actually discussing possibly merging all 3 of these tables...
history
now_playing
queue
And making them one table, where the position would be negative for history items, 0 for now playing item, and positive for upcoming queue items.
If that's the case, then I think we may not need a union like this.
docs/database/0006_queue_content.sql
Outdated
CREATE TABLE queue_content_base ( | ||
id SERIAL PRIMARY KEY, | ||
account_id INTEGER NOT NULL REFERENCES account(id) ON DELETE CASCADE, | ||
position numeric_20_11 NOT NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again-for the numeric data type. Are you anticipating the numbers to be large for position of "queued content?" What would this table hold specifically? Queued uploads to rss feeds? Just trying to clarify things for myself !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, and I got the idea for this blog post. @suorcd and I have discussed this quite a bit, and aren't 100% set on if we want to go this way or not.
The reasoning is basically 1) using integer for position can lead to problems, as reordering items in the list means every item after that item also needs its position updated. 2) using a decimal column allows you to reorder by fitting a reordered item between 2 positions, by calculating the middle point between the two.
An alternative to this could be a jsonb column that maintains the position order data on a higher order table (like playlist), which may be doable for our use case, but we'll lose some referential benefits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case, I think you are good to use the numeric datatype. Looking at this again, it makes more sense from a positioning perspective.
('mixed'), ('podcastL'), ('musicL'), ('videoL'), ('filmL'), ('audiobookL'), ('newsletterL'), ('blogL'), ('publisherL'), ('courseL') | ||
; | ||
|
||
CREATE TABLE medium_base ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AH there we go-found the table :) Ignore my comment about missing the 'Medium' table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to still review the 0001_init_podcasting_20_database.sql file. I will do this tomorrow if I get a chance.
docs/database/0000_init_helpers.sql
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After speaking with my colleagues, what you are doing here makes sense-you are enforcing constraints/checks on commonly used columns that will be amongst your tables. You are essentially creating 'global check constraints.' So, in my eyes, its a great way to keep data integrity. You could also do the same thing per table, setting constraints on the columns by each table (instead of doing the domain). However, its just another way to 'skin a cat'. This just makes it easier across the board.
Some examples my colleague brought up was enforcing format of birthdays or email addresses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking, and yes exactly. A DOMAIN like varchar_url I think is convenient because we can easily ensure each url field is limited to 2083 characters and checks it begins with http:// or https://
No description provided.