-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document and test with column collation #601
Conversation
Currently, we are using the default database collation for the ancestry column. In most rails apps, this is some form of unicode locale. This is skipping indexes for like. On ubuntu systems, it is ignoring slashes for sorting, which causes all sorts of problems This changes tests to just use a simple binary comparison - ascii byte for byte. This gets better results, uses indexes for LIKE, and is faster for all comparisons. ENV["ANCESTRY_COLLATION"] should never be needed unless you are testing performance characteristics
@kshnurov You have anything to add? I already had this PR in the works and was happy when you mentioned binary string comparison and indexes in our other discussion. |
I agree binary should be the default and always used, otherwise you lose big on performance. What's the issue with postgres sorting? Binary just compares bytes. |
by default, people are using utf8 for the Probably need to document the case where people can migrate from utf8 collated column to an ascii column. |
@kbrock why are you mentioning postgres? Can you reproduce the issues you're talking about? |
@kbrock what's the rush with merging something that isn't complete or correct? |
Thank you for asking me to show my work. I was just focused on Postgres. PostgresSetting the column #!/usr/bin/env ruby
# basic_benchmark.rb
require 'active_record'
require 'ancestry'
ActiveRecord::Base.establish_connection(ENV.fetch('DATABASE_URL') { "postgres://localhost/ancestry_benchmark" })
ActiveRecord::Migration.verbose = false
ActiveRecord::Schema.define do
create_table :users, force: true do |t|
col_options = {}
col_options[:collation] = ENV["COL"] if ENV["COL"].present?
t.string :ancestry, col_options
t.string :name
end
add_index :users, :ancestry
end
STDERR.puts "#created schema"
class User < ActiveRecord::Base ; has_ancestry ; end
def create_tree(parent, name: 'Tree 1', count: 2, depth: 2)
if parent.kind_of?(Class) # root
root = parent.create(name: name)
create_tree(root, name: name, count: count, depth: depth - 1)
else
count.times do |i|
# parent_id: parent.id
child = parent.children.create(name: "#{name}.#{i+1}")
create_tree(child, name: child.name, count: count, depth: depth - 1) if depth > 1
end
end
end
create_tree(User, name: 'Tree 1', count: 8, depth: 4)
STDERR.puts "#created tree (#{User.count})"
root=User.roots.order(:id).last # root of tree
puts root.subtree.explain createdb ancestry_benchmark
DATABASE_URL=postgres://localhost/ancestry_benchmark ruby basic_benchmark.rb
DATABASE_URL=postgres://localhost/ancestry_benchmark COL=C ruby basic_benchmark.rb
#created schema
#created tree (585)
EXPLAIN for: SELECT "users".* FROM "users" WHERE ("users"."ancestry" LIKE '1/%' OR "users"."ancestry" = '1' OR "users"."id" = 1)
QUERY PLAN
---------------------------------------------------------------------------------------------
Seq Scan on users (cost=0.00..24.18 rows=9 width=72)
Filter: (((ancestry)::text ~~ '1/%'::text) OR ((ancestry)::text = '1'::text) OR (id = 1))
(2 rows)
#created schema
#created tree (585)
EXPLAIN for: SELECT "users".* FROM "users" WHERE ("users"."ancestry" LIKE '1/%' OR "users"."ancestry" = '1' OR "users"."id" = 1)
QUERY PLAN
---------------------------------------------------------------------------------------------------
Bitmap Heap Scan on users (cost=12.91..23.50 rows=9 width=72)
Recheck Cond: (((ancestry)::text ~~ '1/%'::text) OR ((ancestry)::text = '1'::text) OR (id = 1))
Filter: (((ancestry)::text ~~ '1/%'::text) OR ((ancestry)::text = '1'::text) OR (id = 1))
-> BitmapOr (cost=12.91..12.91 rows=9 width=0)
-> Bitmap Index Scan on index_users_on_ancestry (cost=0.00..4.32 rows=4 width=0)
Index Cond: (((ancestry)::text >= '1/'::text) AND ((ancestry)::text < '10'::text))
-> Bitmap Index Scan on index_users_on_ancestry (cost=0.00..4.31 rows=4 width=0)
Index Cond: ((ancestry)::text = '1'::text)
-> Bitmap Index Scan on users_pkey (cost=0.00..4.28 rows=1 width=0)
Index Cond: (id = 1)
(10 rows) MysqlMysql is not as explicit on explaining the query, but we know that it is using the While the tests are still passing, it looks like mysql is sending binary data for It looks like setting the
|
@kshnurov I misunderstood your comment and thought you said yes. If you notice the hack from the test that was removed, the default collation was affecting the way that Postgres sorted/compared the data. Mac and RedHat version of Postgres use a collation that takes symbols into consideration when comparing strings. Debian based versions of Postgres have a non-standard collation. This collation is case insensitivity and does not take symbols into consideration. It returns data in a different order. But the order of the returned values is just a symptom of the underlying problem. The database is doing extra processing that render the index ineffective on relative comparisons with the string index. This can be fixed by any number of changes including: column data type, character set, collation, and string comparison operators. I feel collation is the simplest, most supported, and probably the best way to get It seems simple for developers to understand. Unicode is complicated, so we just to simplify it and just do a bitwise comparison. The column only contains numbers and symbols, or possibly ascii letters in the uuid case. Just doing a byte comparison works in this case. Now on the mysql front, maybe collation is not the correct answer. We either want to roll it back or possibly want to use |
@kbrock why not just use the |
@kshnurov I'm all ears here. I've seen people embed Using The only reservation I have is producing SQL that is not readable or is tricky to copy paste. If you think binary is the way to go, then it makes sense to change the test suite to specifically test with the preferred format. |
Of course it will work. I'm using |
* Fix: descendants ancestry is now updated in after_update callbacks stefankroes#589 * Document updated grammar stefankroes#594 * Documented `update_strategy` stefankroes#588 * Fix: fixed has_parent? when non-default primary id stefankroes#585 * Documented column collation and testing stefankroes#601 stefankroes#607 * Added initializer with default_ancestry_format stefankroes#612 * ruby 3.2 support stefankroes#596
* Fix: materialized_path2 strategy stefankroes#597 * Fix: descendants ancestry is now updated in after_update callbacks stefankroes#589 * Document updated grammar stefankroes#594 * Documented `update_strategy` stefankroes#588 * Fix: fixed has_parent? when non-default primary id stefankroes#585 * Documented column collation and testing stefankroes#601 stefankroes#607 * Added initializer with default_ancestry_format stefankroes#612 * ruby 3.2 support stefankroes#596
* Fix: materialized_path2 strategy stefankroes#597 * Fix: descendants ancestry is now updated in after_update callbacks stefankroes#589 * Document updated grammar stefankroes#594 * Documented `update_strategy` stefankroes#588 * Fix: fixed has_parent? when non-default primary id stefankroes#585 * Documented column collation and testing stefankroes#601 stefankroes#607 * Added initializer with default_ancestry_format stefankroes#612 * ruby 3.2 support stefankroes#596
Add documentation and tests for users to set collation on the
ancestry
column.We need to encourage users to use a binary/ascii collation for the
ancestry
column otherwise they and loosing the main benefit of materialized path, specifically the fact that we use an index forancestry LIKE '$ancestry/%'
.I'm hoping we can drop
ENV["ANCESTRY_COLLATION"]
which is there because canonical ignores punctuation when sorting/comparing strings. Getting us to use the proper indexes means all systems will behave in a more consistent manner.