Document and test with column collation #601

kbrock · 2023-02-16T03:40:20Z

Add documentation and tests for users to set collation on the ancestry column.

We need to encourage users to use a binary/ascii collation for the ancestry column otherwise they and loosing the main benefit of materialized path, specifically the fact that we use an index for ancestry LIKE '$ancestry/%'.

I'm hoping we can drop ENV["ANCESTRY_COLLATION"] which is there because canonical ignores punctuation when sorting/comparing strings. Getting us to use the proper indexes means all systems will behave in a more consistent manner.

Currently, we are using the default database collation for the ancestry column. In most rails apps, this is some form of unicode locale. This is skipping indexes for like. On ubuntu systems, it is ignoring slashes for sorting, which causes all sorts of problems This changes tests to just use a simple binary comparison - ascii byte for byte. This gets better results, uses indexes for LIKE, and is faster for all comparisons. ENV["ANCESTRY_COLLATION"] should never be needed unless you are testing performance characteristics

kbrock · 2023-02-16T18:17:19Z

@kshnurov You have anything to add?

I already had this PR in the works and was happy when you mentioned binary string comparison and indexes in our other discussion.

kshnurov · 2023-02-17T04:42:27Z

I agree binary should be the default and always used, otherwise you lose big on performance.

What's the issue with postgres sorting? Binary just compares bytes.

kbrock · 2023-02-20T01:42:07Z

by default, people are using utf8 for the ancestry columns. So they are not using an index for LIKE. Using that index is the main draw for this whole format in the first place.

Probably need to document the case where people can migrate from utf8 collated column to an ascii column.

kshnurov · 2023-02-20T08:20:19Z

@kbrock why are you mentioning postgres? Can you reproduce the issues you're talking about?
Also, there's no need to use collation, there's a binary type, add_column :table, :ancestry, :binary

kshnurov · 2023-02-20T08:25:12Z

@kbrock what's the rush with merging something that isn't complete or correct?

kbrock · 2023-02-20T16:31:55Z

Thank you for asking me to show my work. I was just focused on Postgres.

Postgres

Setting the column collation to C on the ancestry column goes from a table scan to an index scan.
In other cases, I have seen multiple OR clauses force a table scan, but luckily that is not the case for this test.

#!/usr/bin/env ruby

# basic_benchmark.rb
require 'active_record'
require 'ancestry'

ActiveRecord::Base.establish_connection(ENV.fetch('DATABASE_URL') { "postgres://localhost/ancestry_benchmark" })

ActiveRecord::Migration.verbose = false
ActiveRecord::Schema.define do
  create_table :users, force: true do |t|
    col_options = {}
    col_options[:collation] = ENV["COL"] if ENV["COL"].present?
    t.string :ancestry, col_options
    t.string :name
  end
  add_index :users, :ancestry
end
STDERR.puts "#created schema"

class User < ActiveRecord::Base ; has_ancestry ; end

def create_tree(parent, name: 'Tree 1', count: 2, depth: 2)
  if parent.kind_of?(Class) # root
    root = parent.create(name: name)
    create_tree(root, name: name, count: count, depth: depth - 1)
  else
    count.times do |i|
      # parent_id: parent.id
      child = parent.children.create(name: "#{name}.#{i+1}")
      create_tree(child, name: child.name, count: count, depth: depth - 1) if depth > 1
    end
  end
end

create_tree(User, name: 'Tree 1', count: 8, depth: 4)
STDERR.puts "#created tree (#{User.count})"

root=User.roots.order(:id).last # root of tree
puts root.subtree.explain

createdb ancestry_benchmark
DATABASE_URL=postgres://localhost/ancestry_benchmark ruby basic_benchmark.rb
DATABASE_URL=postgres://localhost/ancestry_benchmark COL=C ruby basic_benchmark.rb

#created schema
#created tree (585)
EXPLAIN for: SELECT "users".* FROM "users" WHERE ("users"."ancestry" LIKE '1/%' OR "users"."ancestry" = '1' OR "users"."id" = 1)
                                         QUERY PLAN
---------------------------------------------------------------------------------------------
 Seq Scan on users  (cost=0.00..24.18 rows=9 width=72)
   Filter: (((ancestry)::text ~~ '1/%'::text) OR ((ancestry)::text = '1'::text) OR (id = 1))
(2 rows)

#created schema
#created tree (585)
EXPLAIN for: SELECT "users".* FROM "users" WHERE ("users"."ancestry" LIKE '1/%' OR "users"."ancestry" = '1' OR "users"."id" = 1)
                                            QUERY PLAN
---------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on users  (cost=12.91..23.50 rows=9 width=72)
   Recheck Cond: (((ancestry)::text ~~ '1/%'::text) OR ((ancestry)::text = '1'::text) OR (id = 1))
   Filter: (((ancestry)::text ~~ '1/%'::text) OR ((ancestry)::text = '1'::text) OR (id = 1))
   ->  BitmapOr  (cost=12.91..12.91 rows=9 width=0)
         ->  Bitmap Index Scan on index_users_on_ancestry  (cost=0.00..4.32 rows=4 width=0)
               Index Cond: (((ancestry)::text >= '1/'::text) AND ((ancestry)::text < '10'::text))
         ->  Bitmap Index Scan on index_users_on_ancestry  (cost=0.00..4.31 rows=4 width=0)
               Index Cond: ((ancestry)::text = '1'::text)
         ->  Bitmap Index Scan on users_pkey  (cost=0.00..4.28 rows=1 width=0)
               Index Cond: (id = 1)
(10 rows)

Mysql

Mysql is not as explicit on explaining the query, but we know that it is using the primary index for the id and the index on ancestry. This is the case both before and after the changes.

While the tests are still passing, it looks like mysql is sending binary data for ancestry, So probably changing the column collation to binary for mysql is incorrect.

It looks like setting the character set to ASCII may be the best approach, but that is not supported by rails and it would force developers to use schema.sql, which I would like to avoid.

mysql --host localhost --port 3306 -u root  -e 'CREATE SCHEMA IF NOT EXISTS 'ancestry_test';'
DATABASE_URL=mysql2://localhost/ancestry_benchmark ruby basic_benchmark.rb
DATABASE_URL=mysql2://localhost/ancestry_benchmark COL=binary ruby basic_benchmark.rb

#created schema
#created tree (585)
EXPLAIN for: SELECT `users`.* FROM `users` WHERE (`users`.`ancestry` LIKE '1/%' OR `users`.`ancestry` = '1' OR `users`.`id` = 1)
+----+-------------+-------+------------+------+---------------------------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys                   | key  | key_len | ref  | rows | filtered | Extra       |
+----+-------------+-------+------------+------+---------------------------------+------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | users | NULL       | ALL  | PRIMARY,index_users_on_ancestry | NULL | NULL    | NULL |  585 |    11.41 | Using where |
+----+-------------+-------+------------+------+---------------------------------+------+---------+------+------+----------+-------------+
1 row in set (0.00 sec)

#created schema
#created tree (585)
EXPLAIN for: SELECT `users`.* FROM `users` WHERE (`users`.`ancestry` LIKE x'312f25' OR `users`.`ancestry` = x'31' OR `users`.`id` = 1)
+----+-------------+-------+------------+------+---------------------------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys                   | key  | key_len | ref  | rows | filtered | Extra       |
+----+-------------+-------+------------+------+---------------------------------+------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | users | NULL       | ALL  | PRIMARY,index_users_on_ancestry | NULL | NULL    | NULL |  585 |    11.41 | Using where |
+----+-------------+-------+------------+------+---------------------------------+------+---------+------+------+----------+-------------+
1 row in set (0.00 sec)

kbrock · 2023-02-20T16:47:07Z

@kshnurov I misunderstood your comment and thought you said yes.

If you notice the hack from the test that was removed, the default collation was affecting the way that Postgres sorted/compared the data. Mac and RedHat version of Postgres use a collation that takes symbols into consideration when comparing strings. Debian based versions of Postgres have a non-standard collation. This collation is case insensitivity and does not take symbols into consideration. It returns data in a different order.

But the order of the returned values is just a symptom of the underlying problem. The database is doing extra processing that render the index ineffective on relative comparisons with the string index.

This can be fixed by any number of changes including: column data type, character set, collation, and string comparison operators.

I feel collation is the simplest, most supported, and probably the best way to get LIKE 'a%' working again on Postgres.

It seems simple for developers to understand. Unicode is complicated, so we just to simplify it and just do a bitwise comparison. The column only contains numbers and symbols, or possibly ascii letters in the uuid case. Just doing a byte comparison works in this case.

Now on the mysql front, maybe collation is not the correct answer. We either want to roll it back or possibly want to use latin1. The query plan doesn't make the comparison look that different. I have not used mysql since rails 1 or 2, so I hope you have an opinion on the best way to move forward here.

kshnurov · 2023-02-20T17:11:34Z

@kbrock why not just use the binary data type instead of collation? I've been using it with mysql for a while, works perfectly, can't see why it would be different with other DBs. It does exactly what we need - stores/compares raw bytes.

kbrock · 2023-02-20T18:22:40Z

@kshnurov I'm all ears here.

I've seen people embed ancestry into a form fields or manually manipulate the fields using ruby e.g.: ancestry + '/%', "#{ancestry}/%", split('/').map { |r| find(r).name }.join('/'). Will this work in with a binary field?

Using ascii character type seems to make the most sense to me, but I guess I'm just not familiar with what a binary type is. The collation binary seemed strange to me as well. Using a binary data type sure looks very simple.

The only reservation I have is producing SQL that is not readable or is tricky to copy paste.

If you think binary is the way to go, then it makes sense to change the test suite to specifically test with the preferred format.

kshnurov · 2023-02-21T05:38:29Z

Of course it will work. binary is a byte string both in MySQL and Postgres.
It's still a string, looks and operates the same way:

I'm using t.binary "ancestry", limit: 3000, null: false because MySQL has an index limit of 3072 bytes, you'll get an error if you exceed that. I'm not very familiar with PG, but it looks like the default limit is 2730 bytes.
Also, a limit <= 4095 is needed so Rails will create a varbinary column instead of blob.
We can fix that at 2700, that's an ancestry depth of 300-600 with int primary keys and ~81 with uuids.

* Fix: descendants ancestry is now updated in after_update callbacks stefankroes#589 * Document updated grammar stefankroes#594 * Documented `update_strategy` stefankroes#588 * Fix: fixed has_parent? when non-default primary id stefankroes#585 * Documented column collation and testing stefankroes#601 stefankroes#607 * Added initializer with default_ancestry_format stefankroes#612 * ruby 3.2 support stefankroes#596

* Fix: materialized_path2 strategy stefankroes#597 * Fix: descendants ancestry is now updated in after_update callbacks stefankroes#589 * Document updated grammar stefankroes#594 * Documented `update_strategy` stefankroes#588 * Fix: fixed has_parent? when non-default primary id stefankroes#585 * Documented column collation and testing stefankroes#601 stefankroes#607 * Added initializer with default_ancestry_format stefankroes#612 * ruby 3.2 support stefankroes#596

kbrock added 2 commits February 16, 2023 12:34

add function for postgres detection

e7459e9

kbrock force-pushed the latin branch from 15496f8 to ced8a6f Compare February 16, 2023 17:36

kbrock added the features label Feb 16, 2023

kbrock mentioned this pull request Feb 16, 2023

anyone could give clarity to the readme? #593

Closed

kbrock merged commit bdb9143 into stefankroes:master Feb 20, 2023

kbrock deleted the latin branch February 20, 2023 01:40

This was referenced Feb 24, 2023

Fix materialized_path2 #597

Merged

release next version #603

Closed

Improve documentation around ancestry field and collation #607

Merged

kbrock mentioned this pull request Mar 3, 2023

release v4.3 CHANGELOG [skip ci] #614

Merged

kshnurov mentioned this pull request Mar 10, 2023

Falseful changelog, incorrect README and broken code in 4.3.0 #622

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document and test with column collation #601

Document and test with column collation #601

kbrock commented Feb 16, 2023 •

edited

Loading

kbrock commented Feb 16, 2023 •

edited

Loading

kshnurov commented Feb 17, 2023 •

edited

Loading

kbrock commented Feb 20, 2023

kshnurov commented Feb 20, 2023 •

edited

Loading

kshnurov commented Feb 20, 2023

kbrock commented Feb 20, 2023

kbrock commented Feb 20, 2023

kshnurov commented Feb 20, 2023 •

edited

Loading

kbrock commented Feb 20, 2023 •

edited

Loading

kshnurov commented Feb 21, 2023 •

edited

Loading

Document and test with column collation #601

Document and test with column collation #601

Conversation

kbrock commented Feb 16, 2023 • edited Loading

kbrock commented Feb 16, 2023 • edited Loading

kshnurov commented Feb 17, 2023 • edited Loading

kbrock commented Feb 20, 2023

kshnurov commented Feb 20, 2023 • edited Loading

kshnurov commented Feb 20, 2023

kbrock commented Feb 20, 2023

Postgres

Mysql

kbrock commented Feb 20, 2023

kshnurov commented Feb 20, 2023 • edited Loading

kbrock commented Feb 20, 2023 • edited Loading

kshnurov commented Feb 21, 2023 • edited Loading

kbrock commented Feb 16, 2023 •

edited

Loading

kbrock commented Feb 16, 2023 •

edited

Loading

kshnurov commented Feb 17, 2023 •

edited

Loading

kshnurov commented Feb 20, 2023 •

edited

Loading

kshnurov commented Feb 20, 2023 •

edited

Loading

kbrock commented Feb 20, 2023 •

edited

Loading

kshnurov commented Feb 21, 2023 •

edited

Loading