Cant delete large file #1185

tnatanael · 2019-05-20T12:15:19Z

Hy guys, i created a simple cluster, with 2 storages, and after uploading a file with 1Gb and running the cluster for 1 week, i am not able to delete this file, the delete operation runs with success but the file persists...

What i tried:
recover-node
recover-disk
recover-consistency

I wonder that when i put the cluster in the production env, with so many files this would be a very annoying bug, so please help me.

mocchira · 2019-05-21T07:04:30Z

@tnatanael What you'd have to do for deleting file physically is "compaction-start" as described here: http://leo-project.net/leofs/docs/admin/system_operations/data/#how-to-operate-data-compaction

please check the doc above out for more details.

tnatanael · 2019-05-21T11:19:44Z

Tried with leofs-adm compact-start, it says OK, but the file persists, even after waiting the process to finish

yosukehara · 2019-05-21T11:27:20Z

Let us know your LeoFS' error log and the state of the large object:

The error log is output under LeoStorage's log directory
To understand the current state of the object, execute leofs-adm whereis <OBJECT_NAME> command

tnatanael · 2019-05-21T11:39:32Z

How can i discover the object name? Is it the filename of the original file?

tnatanael · 2019-05-21T11:42:36Z

tnatanael · 2019-05-21T11:43:08Z

when i run compact and when i try to delete the file using s3 api this error message pops on log

yosukehara · 2019-05-21T11:43:50Z

Is it the filename of the original file?

Exactly, leofs-adm whereis <file-path>

tnatanael · 2019-05-21T11:45:37Z

I tried with only filename...
leofs-adm whereis 1000mb_1

And bucket + filename
leofs-adm whereis teste-thiago/1000mb_1

The 2 options says
[ERROR] Could not get ring

tnatanael · 2019-05-21T11:47:57Z

I am 100% sure that the file was corrupt due to disk failures, but it may need to be cleared by the manual delete or auto by the cluster in some way

yosukehara · 2019-05-21T11:50:26Z

I've understood that your LeoFS' RING (routing table) is broken. So let me know the current state of the system. Can you share the result of leofs-adm status and the operation histories to this day?

tnatanael · 2019-05-21T11:52:32Z

Sure!!!

It is a test cluster, i am simulating a disk failure we experienced in production

How can i get the operations history?

yosukehara · 2019-05-21T11:54:38Z

How can i get the operations history?

$ history | gpre leofs-adm

tnatanael · 2019-05-21T11:56:40Z

Do you want a team viewer session?

yosukehara · 2019-05-21T11:59:11Z

I've just clearly understood that LeoManager's RING is broken.
I'm going to consider how to recover the system.

tnatanael · 2019-05-21T12:01:55Z

Ok... only to state, the cluster is still working, i am uploading and removing new files right now... only this file is undeletable...

yosukehara · 2019-05-21T12:03:39Z

TO: @mocchira your opinion will be much appreciated.

tnatanael · 2019-05-22T12:21:51Z

Hy guys! Can this ticket be labelled as a bug instead of question?

yosukehara · 2019-05-22T12:42:55Z

Please do the following if you understand that we may NOT be able to restore your system completely.
I considered how to recover your LeoManager’s RING as below:

Procedure:

Stop the all nodes of LeoStorage, LeoGateway, and LeoManager
Backup all files of LeoManager nodes (Archive LeoManager‘s directory)
After data backup of LeoManager nodes, Remove Mnesia files under mnesia directory of both LeoManager nodes
Start the LeoManager nodes
Start the LeoStorage nodes
Execute leofs-adm start
Confirm the state of the system with leofs-adm status
Start the LeoGateway node

If you succeeded in doing the procedure, you can execute the data-compaction command.

yosukehara · 2019-05-23T01:34:06Z

I'd like to share an example of the procedure of recovering LeoManager's RING as below.

[Example] How To Recover LeoManager's RING

Before recovery:

$ leofs-adm status
 [System Confiuration]
-----------------------------------+----------
 Item                              | Value
-----------------------------------+----------
 Basic/Consistency level
-----------------------------------+----------
                    system version | 1.5.0
                        cluster Id | leofs_1
                             DC Id | dc_1
                    Total replicas | 2
          number of successes of R | 1
          number of successes of W | 1
          number of successes of D | 1
 number of rack-awareness replicas | 0
                         ring size | 2^128
-----------------------------------+----------
 Multi DC replication settings
-----------------------------------+----------
 [mdcr] max number of joinable DCs | 2
 [mdcr] total replicas per a DC    | 1
 [mdcr] number of successes of R   | 1
 [mdcr] number of successes of W   | 1
 [mdcr] number of successes of D   | 1
-----------------------------------+----------
 Manager RING hash
-----------------------------------+----------
                 current ring-hash |
                previous ring-hash |
-----------------------------------+----------

 [State of Node(s)]
-------+--------------------------+--------------+---------+----------------+----------------+----------------------------
 type  |           node           |    state     | rack id |  current ring  |   prev ring    |          updated at
-------+--------------------------+--------------+---------+----------------+----------------+----------------------------
  S    | storage_0@127.0.0.1      | running      |         | d5d667a6       | d5d667a6       | 2019-05-23 10:10:33 +0900
  S    | storage_1@127.0.0.1      | running      |         | d5d667a6       | d5d667a6       | 2019-05-23 10:10:33 +0900
  S    | storage_2@127.0.0.1      | running      |         | d5d667a6       | d5d667a6       | 2019-05-23 10:10:33 +0900
-------+--------------------------+--------------+---------+----------------+----------------+----------------------------

The Procedure of Recovering LeoManager's RING

1. Stop the all nodes

$ ./package/leo_manager_0/bin/leo_manager stop
ok
$ ./package/leo_manager_1/bin/leo_manager stop
ok
$ ./package/leo_gateway_0/bin/leo_gateway stop
ok
$ ./package/leo_storage_0/bin/leo_storage stop
ok
$ ./package/leo_storage_1/bin/leo_storage stop
ok
$ ./package/leo_storage_2/bin/leo_storage stop
ok

2. Archive LeoManager's directories

$ tar czf leo_manager_0_backup.ta.gz ./package/leo_manager_0/
$ tar czf leo_manager_1_backup.ta.gz ./package/leo_manager_1/

$ ls -la | grep backup.ta.gz
-rw-r--r--   1 yosukehara  staff  15435040 May 23 10:12 leo_manager_0_backup.ta.gz
-rw-r--r--   1 yosukehara  staff  15429047 May 23 10:12 leo_manager_1_backup.ta.gz

3. Remove the LeoManager's data directories

## manager_0:
$ rm -rf ./package/leo_manager_0/work/mnesia/*
## manager_1:
$ rm -rf ./package/leo_manager_1/work/mnesia/*

4. Restart the all nodes except LeoGateway's node(s)

$ ./package/leo_manager_0/bin/leo_manager start
$ ./package/leo_manager_1/bin/leo_manager start
$ ./package/leo_storage_0/bin/leo_storage start
$ ./package/leo_storage_1/bin/leo_storage start
$ ./package/leo_storage_2/bin/leo_storage start
$ leofs-adm status
 [System Confiuration]
-----------------------------------+----------
 Item                              | Value
-----------------------------------+----------
 Basic/Consistency level
-----------------------------------+----------
                    system version | 1.5.0
                        cluster Id | leofs_1
                             DC Id | dc_1
                    Total replicas | 2
          number of successes of R | 1
          number of successes of W | 1
          number of successes of D | 1
 number of rack-awareness replicas | 0
                         ring size | 2^128
-----------------------------------+----------
 Multi DC replication settings
-----------------------------------+----------
 [mdcr] max number of joinable DCs | 2
 [mdcr] total replicas per a DC    | 1
 [mdcr] number of successes of R   | 1
 [mdcr] number of successes of W   | 1
 [mdcr] number of successes of D   | 1
-----------------------------------+----------
 Manager RING hash
-----------------------------------+----------
                 current ring-hash |
                previous ring-hash |
-----------------------------------+----------

 [State of Node(s)]
-------+--------------------------+--------------+---------+----------------+----------------+----------------------------
 type  |           node           |    state     | rack id |  current ring  |   prev ring    |          updated at
-------+--------------------------+--------------+---------+----------------+----------------+----------------------------
  S    | storage_0@127.0.0.1      | attached     |         |                |                | 2019-05-23 10:14:00 +0900
  S    | storage_1@127.0.0.1      | attached     |         |                |                | 2019-05-23 10:14:03 +0900
  S    | storage_2@127.0.0.1      | attached     |         |                |                | 2019-05-23 10:14:05 +0900
-------+--------------------------+--------------+---------+----------------+----------------+----------------------------

After restarted the all nodes, execute `leofs-adm start` command:

$ leofs-adm start
Generating RING...
Generated RING
OK  33% - storage_0@127.0.0.1
OK  67% - storage_2@127.0.0.1
OK 100% - storage_1@127.0.0.1
OK

5. Restart LeoGateway node(s)

$ ./package/leo_gateway_0/bin/leo_gateway start

6. Confirm the RING hashes of the all nodes:

$ ./leofs-adm status
 [System Confiuration]
-----------------------------------+----------
 Item                              | Value
-----------------------------------+----------
 Basic/Consistency level
-----------------------------------+----------
                    system version | 1.5.0
                        cluster Id | leofs_1
                             DC Id | dc_1
                    Total replicas | 2
          number of successes of R | 1
          number of successes of W | 1
          number of successes of D | 1
 number of rack-awareness replicas | 0
                         ring size | 2^128
-----------------------------------+----------
 Multi DC replication settings
-----------------------------------+----------
 [mdcr] max number of joinable DCs | 2
 [mdcr] total replicas per a DC    | 1
 [mdcr] number of successes of R   | 1
 [mdcr] number of successes of W   | 1
 [mdcr] number of successes of D   | 1
-----------------------------------+----------
 Manager RING hash
-----------------------------------+----------
                 current ring-hash | d5d667a6
                previous ring-hash | d5d667a6
-----------------------------------+----------

 [State of Node(s)]
-------+--------------------------+--------------+---------+----------------+----------------+----------------------------
 type  |           node           |    state     | rack id |  current ring  |   prev ring    |          updated at
-------+--------------------------+--------------+---------+----------------+----------------+----------------------------
  S    | storage_0@127.0.0.1      | running      |         | d5d667a6       | d5d667a6       | 2019-05-23 10:14:16 +0900
  S    | storage_1@127.0.0.1      | running      |         | d5d667a6       | d5d667a6       | 2019-05-23 10:14:16 +0900
  S    | storage_2@127.0.0.1      | running      |         | d5d667a6       | d5d667a6       | 2019-05-23 10:14:16 +0900
  G    | gateway_0@127.0.0.1      | running      |         | d5d667a6       | d5d667a6       | 2019-05-23 10:14:32 +0900
-------+--------------------------+--------------+---------+----------------+----------------+----------------------------

The important thing is that the values of current ring and prev ring before recovery and the values after recovery are the same - current ring: d5d667a6, prev ring: d5d667a6.

tnatanael · 2019-05-23T01:54:28Z

I'll try this tomorrow and return, but i am wondering why this happens? It appear to be an expected behaviour? Thanks for now!

tnatanael · 2019-05-24T11:54:57Z

Sorry for delay... it worked... after that procedure i was able to delete the file...
Thanks!

akezakky555 · 2019-07-25T06:02:33Z

@yosukehara I tried to do follow your instructions but I found some problem. After I restart all leofs service. All User and Buckets are disappear.

After Remove mnesia folder

So I recovery mnesia folder at leo_manager Everything is back. but RING is broken again.

Can you suggest us how to fix this problem?

Thanks

mocchira added the Question label May 21, 2019

tnatanael closed this as completed May 24, 2019

yosukehara mentioned this issue Sep 18, 2019

INCONSISTENT HASH | manager node and other nodes. #1195

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cant delete large file #1185

Cant delete large file #1185

tnatanael commented May 20, 2019

mocchira commented May 21, 2019

tnatanael commented May 21, 2019

yosukehara commented May 21, 2019

tnatanael commented May 21, 2019

tnatanael commented May 21, 2019

tnatanael commented May 21, 2019

yosukehara commented May 21, 2019

tnatanael commented May 21, 2019

tnatanael commented May 21, 2019

yosukehara commented May 21, 2019

tnatanael commented May 21, 2019

yosukehara commented May 21, 2019

tnatanael commented May 21, 2019

yosukehara commented May 21, 2019

tnatanael commented May 21, 2019

yosukehara commented May 21, 2019

tnatanael commented May 22, 2019

yosukehara commented May 22, 2019 •

edited

Loading

yosukehara commented May 23, 2019 •

edited

Loading

tnatanael commented May 23, 2019

tnatanael commented May 24, 2019

akezakky555 commented Jul 25, 2019

Cant delete large file #1185

Cant delete large file #1185

Comments

tnatanael commented May 20, 2019

mocchira commented May 21, 2019

tnatanael commented May 21, 2019

yosukehara commented May 21, 2019

tnatanael commented May 21, 2019

tnatanael commented May 21, 2019

tnatanael commented May 21, 2019

yosukehara commented May 21, 2019

tnatanael commented May 21, 2019

tnatanael commented May 21, 2019

yosukehara commented May 21, 2019

tnatanael commented May 21, 2019

yosukehara commented May 21, 2019

tnatanael commented May 21, 2019

yosukehara commented May 21, 2019

tnatanael commented May 21, 2019

yosukehara commented May 21, 2019

tnatanael commented May 22, 2019

yosukehara commented May 22, 2019 • edited Loading

Procedure:

yosukehara commented May 23, 2019 • edited Loading

[Example] How To Recover LeoManager's RING

Before recovery:

The Procedure of Recovering LeoManager's RING

1. Stop the all nodes

2. Archive LeoManager's directories

3. Remove the LeoManager's data directories

4. Restart the all nodes except LeoGateway's node(s)

After restarted the all nodes, execute leofs-adm start command:

5. Restart LeoGateway node(s)

6. Confirm the RING hashes of the all nodes:

tnatanael commented May 23, 2019

tnatanael commented May 24, 2019

akezakky555 commented Jul 25, 2019

yosukehara commented May 22, 2019 •

edited

Loading

yosukehara commented May 23, 2019 •

edited

Loading

After restarted the all nodes, execute `leofs-adm start` command: