Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document and test with column collation #601

Merged
merged 2 commits into from
Feb 20, 2023
Merged

Conversation

kbrock
Copy link
Collaborator

@kbrock kbrock commented Feb 16, 2023

Add documentation and tests for users to set collation on the ancestry column.

We need to encourage users to use a binary/ascii collation for the ancestry column otherwise they and loosing the main benefit of materialized path, specifically the fact that we use an index for ancestry LIKE '$ancestry/%'.

I'm hoping we can drop ENV["ANCESTRY_COLLATION"] which is there because canonical ignores punctuation when sorting/comparing strings. Getting us to use the proper indexes means all systems will behave in a more consistent manner.

Currently, we are using the default database collation for the ancestry column.
In most rails apps, this is some form of unicode locale.

This is skipping indexes for like.
On ubuntu systems, it is ignoring slashes for sorting, which causes all sorts of problems

This changes tests to just use a simple binary comparison - ascii byte for byte.
This gets better results, uses indexes for LIKE, and is faster for all comparisons.

ENV["ANCESTRY_COLLATION"] should never be needed unless you are testing performance characteristics
@kbrock
Copy link
Collaborator Author

kbrock commented Feb 16, 2023

@kshnurov You have anything to add?

I already had this PR in the works and was happy when you mentioned binary string comparison and indexes in our other discussion.

@kshnurov
Copy link
Contributor

kshnurov commented Feb 17, 2023

I agree binary should be the default and always used, otherwise you lose big on performance.

What's the issue with postgres sorting? Binary just compares bytes.

@kbrock kbrock merged commit bdb9143 into stefankroes:master Feb 20, 2023
@kbrock kbrock deleted the latin branch February 20, 2023 01:40
@kbrock
Copy link
Collaborator Author

kbrock commented Feb 20, 2023

by default, people are using utf8 for the ancestry columns. So they are not using an index for LIKE. Using that index is the main draw for this whole format in the first place.

Probably need to document the case where people can migrate from utf8 collated column to an ascii column.

@kshnurov
Copy link
Contributor

kshnurov commented Feb 20, 2023

@kbrock why are you mentioning postgres? Can you reproduce the issues you're talking about?
Also, there's no need to use collation, there's a binary type, add_column :table, :ancestry, :binary

@kshnurov
Copy link
Contributor

@kbrock what's the rush with merging something that isn't complete or correct?

@kbrock
Copy link
Collaborator Author

kbrock commented Feb 20, 2023

Thank you for asking me to show my work. I was just focused on Postgres.

Postgres

Setting the column collation to C on the ancestry column goes from a table scan to an index scan.
In other cases, I have seen multiple OR clauses force a table scan, but luckily that is not the case for this test.

#!/usr/bin/env ruby

# basic_benchmark.rb
require 'active_record'
require 'ancestry'

ActiveRecord::Base.establish_connection(ENV.fetch('DATABASE_URL') { "postgres://localhost/ancestry_benchmark" })

ActiveRecord::Migration.verbose = false
ActiveRecord::Schema.define do
  create_table :users, force: true do |t|
    col_options = {}
    col_options[:collation] = ENV["COL"] if ENV["COL"].present?
    t.string :ancestry, col_options
    t.string :name
  end
  add_index :users, :ancestry
end
STDERR.puts "#created schema"

class User < ActiveRecord::Base ; has_ancestry ; end

def create_tree(parent, name: 'Tree 1', count: 2, depth: 2)
  if parent.kind_of?(Class) # root
    root = parent.create(name: name)
    create_tree(root, name: name, count: count, depth: depth - 1)
  else
    count.times do |i|
      # parent_id: parent.id
      child = parent.children.create(name: "#{name}.#{i+1}")
      create_tree(child, name: child.name, count: count, depth: depth - 1) if depth > 1
    end
  end
end

create_tree(User, name: 'Tree 1', count: 8, depth: 4)
STDERR.puts "#created tree (#{User.count})"

root=User.roots.order(:id).last # root of tree
puts root.subtree.explain
createdb ancestry_benchmark
DATABASE_URL=postgres://localhost/ancestry_benchmark ruby basic_benchmark.rb
DATABASE_URL=postgres://localhost/ancestry_benchmark COL=C ruby basic_benchmark.rb

#created schema
#created tree (585)
EXPLAIN for: SELECT "users".* FROM "users" WHERE ("users"."ancestry" LIKE '1/%' OR "users"."ancestry" = '1' OR "users"."id" = 1)
                                         QUERY PLAN
---------------------------------------------------------------------------------------------
 Seq Scan on users  (cost=0.00..24.18 rows=9 width=72)
   Filter: (((ancestry)::text ~~ '1/%'::text) OR ((ancestry)::text = '1'::text) OR (id = 1))
(2 rows)

#created schema
#created tree (585)
EXPLAIN for: SELECT "users".* FROM "users" WHERE ("users"."ancestry" LIKE '1/%' OR "users"."ancestry" = '1' OR "users"."id" = 1)
                                            QUERY PLAN
---------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on users  (cost=12.91..23.50 rows=9 width=72)
   Recheck Cond: (((ancestry)::text ~~ '1/%'::text) OR ((ancestry)::text = '1'::text) OR (id = 1))
   Filter: (((ancestry)::text ~~ '1/%'::text) OR ((ancestry)::text = '1'::text) OR (id = 1))
   ->  BitmapOr  (cost=12.91..12.91 rows=9 width=0)
         ->  Bitmap Index Scan on index_users_on_ancestry  (cost=0.00..4.32 rows=4 width=0)
               Index Cond: (((ancestry)::text >= '1/'::text) AND ((ancestry)::text < '10'::text))
         ->  Bitmap Index Scan on index_users_on_ancestry  (cost=0.00..4.31 rows=4 width=0)
               Index Cond: ((ancestry)::text = '1'::text)
         ->  Bitmap Index Scan on users_pkey  (cost=0.00..4.28 rows=1 width=0)
               Index Cond: (id = 1)
(10 rows)

Mysql

Mysql is not as explicit on explaining the query, but we know that it is using the primary index for the id and the index on ancestry. This is the case both before and after the changes.

While the tests are still passing, it looks like mysql is sending binary data for ancestry, So probably changing the column collation to binary for mysql is incorrect.

It looks like setting the character set to ASCII may be the best approach, but that is not supported by rails and it would force developers to use schema.sql, which I would like to avoid.

mysql --host localhost --port 3306 -u root  -e 'CREATE SCHEMA IF NOT EXISTS 'ancestry_test';'
DATABASE_URL=mysql2://localhost/ancestry_benchmark ruby basic_benchmark.rb
DATABASE_URL=mysql2://localhost/ancestry_benchmark COL=binary ruby basic_benchmark.rb

#created schema
#created tree (585)
EXPLAIN for: SELECT `users`.* FROM `users` WHERE (`users`.`ancestry` LIKE '1/%' OR `users`.`ancestry` = '1' OR `users`.`id` = 1)
+----+-------------+-------+------------+------+---------------------------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys                   | key  | key_len | ref  | rows | filtered | Extra       |
+----+-------------+-------+------------+------+---------------------------------+------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | users | NULL       | ALL  | PRIMARY,index_users_on_ancestry | NULL | NULL    | NULL |  585 |    11.41 | Using where |
+----+-------------+-------+------------+------+---------------------------------+------+---------+------+------+----------+-------------+
1 row in set (0.00 sec)

#created schema
#created tree (585)
EXPLAIN for: SELECT `users`.* FROM `users` WHERE (`users`.`ancestry` LIKE x'312f25' OR `users`.`ancestry` = x'31' OR `users`.`id` = 1)
+----+-------------+-------+------------+------+---------------------------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys                   | key  | key_len | ref  | rows | filtered | Extra       |
+----+-------------+-------+------------+------+---------------------------------+------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | users | NULL       | ALL  | PRIMARY,index_users_on_ancestry | NULL | NULL    | NULL |  585 |    11.41 | Using where |
+----+-------------+-------+------------+------+---------------------------------+------+---------+------+------+----------+-------------+
1 row in set (0.00 sec)

@kbrock
Copy link
Collaborator Author

kbrock commented Feb 20, 2023

@kshnurov I misunderstood your comment and thought you said yes.

If you notice the hack from the test that was removed, the default collation was affecting the way that Postgres sorted/compared the data. Mac and RedHat version of Postgres use a collation that takes symbols into consideration when comparing strings. Debian based versions of Postgres have a non-standard collation. This collation is case insensitivity and does not take symbols into consideration. It returns data in a different order.

But the order of the returned values is just a symptom of the underlying problem. The database is doing extra processing that render the index ineffective on relative comparisons with the string index.

This can be fixed by any number of changes including: column data type, character set, collation, and string comparison operators.

I feel collation is the simplest, most supported, and probably the best way to get LIKE 'a%' working again on Postgres.

It seems simple for developers to understand. Unicode is complicated, so we just to simplify it and just do a bitwise comparison. The column only contains numbers and symbols, or possibly ascii letters in the uuid case. Just doing a byte comparison works in this case.

Now on the mysql front, maybe collation is not the correct answer. We either want to roll it back or possibly want to use latin1. The query plan doesn't make the comparison look that different. I have not used mysql since rails 1 or 2, so I hope you have an opinion on the best way to move forward here.

@kshnurov
Copy link
Contributor

kshnurov commented Feb 20, 2023

@kbrock why not just use the binary data type instead of collation? I've been using it with mysql for a while, works perfectly, can't see why it would be different with other DBs. It does exactly what we need - stores/compares raw bytes.

@kbrock
Copy link
Collaborator Author

kbrock commented Feb 20, 2023

@kshnurov I'm all ears here.

I've seen people embed ancestry into a form fields or manually manipulate the fields using ruby e.g.: ancestry + '/%', "#{ancestry}/%", split('/').map { |r| find(r).name }.join('/'). Will this work in with a binary field?

Using ascii character type seems to make the most sense to me, but I guess I'm just not familiar with what a binary type is. The collation binary seemed strange to me as well. Using a binary data type sure looks very simple.

The only reservation I have is producing SQL that is not readable or is tricky to copy paste.

If you think binary is the way to go, then it makes sense to change the test suite to specifically test with the preferred format.

@kshnurov
Copy link
Contributor

kshnurov commented Feb 21, 2023

Of course it will work. binary is a byte string both in MySQL and Postgres.
It's still a string, looks and operates the same way:
Screenshot 2023-02-21 at 13 07 16

I'm using t.binary "ancestry", limit: 3000, null: false because MySQL has an index limit of 3072 bytes, you'll get an error if you exceed that. I'm not very familiar with PG, but it looks like the default limit is 2730 bytes.
Also, a limit <= 4095 is needed so Rails will create a varbinary column instead of blob.
We can fix that at 2700, that's an ancestry depth of 300-600 with int primary keys and ~81 with uuids.

kbrock added a commit to kbrock/ancestry that referenced this pull request Mar 3, 2023
* Fix: descendants ancestry is now updated in after_update callbacks stefankroes#589
* Document updated grammar stefankroes#594
* Documented `update_strategy` stefankroes#588
* Fix: fixed has_parent? when non-default primary id stefankroes#585
* Documented column collation and testing stefankroes#601 stefankroes#607
* Added initializer with default_ancestry_format stefankroes#612
* ruby 3.2 support stefankroes#596
kbrock added a commit to kbrock/ancestry that referenced this pull request Mar 3, 2023
* Fix: materialized_path2 strategy stefankroes#597
* Fix: descendants ancestry is now updated in after_update callbacks stefankroes#589
* Document updated grammar stefankroes#594
* Documented `update_strategy` stefankroes#588
* Fix: fixed has_parent? when non-default primary id stefankroes#585
* Documented column collation and testing stefankroes#601 stefankroes#607
* Added initializer with default_ancestry_format stefankroes#612
* ruby 3.2 support stefankroes#596
kbrock added a commit to kbrock/ancestry that referenced this pull request Mar 4, 2023
* Fix: materialized_path2 strategy stefankroes#597
* Fix: descendants ancestry is now updated in after_update callbacks stefankroes#589
* Document updated grammar stefankroes#594
* Documented `update_strategy` stefankroes#588
* Fix: fixed has_parent? when non-default primary id stefankroes#585
* Documented column collation and testing stefankroes#601 stefankroes#607
* Added initializer with default_ancestry_format stefankroes#612
* ruby 3.2 support stefankroes#596
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants