Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self-test failures when using non-default page/sector/block sizes. #77

Open
pjsg opened this issue Mar 15, 2016 · 11 comments
Open

Self-test failures when using non-default page/sector/block sizes. #77

pjsg opened this issue Mar 15, 2016 · 11 comments

Comments

@pjsg
Copy link
Contributor

pjsg commented Mar 15, 2016

I'm one of the nodemcu-firmware maintainers and we use spiffs as our filesystem. I wanted to upgrade to the latest version, but when I tried to run your self-tests with our page/sector/block parameters, I get self-test failures.

My diff against master is:

--- a/src/test/params_test.h
+++ b/src/test/params_test.h
@@ -20,9 +20,9 @@
 // test using filesystem magic length
 #define SPIFFS_USE_MAGIC_LENGTH   1

-#define SECTOR_SIZE         65536
-#define LOG_BLOCK           (SECTOR_SIZE*2)
-#define LOG_PAGE            (SECTOR_SIZE/256)
+#define SECTOR_SIZE         4096
+#define LOG_BLOCK           4096
+#define LOG_PAGE            256

 #define FD_BUF_SIZE     64*6
 #define CACHE_BUF_SIZE  (LOG_PAGE + 32)*8

After doing make clean all test, I get:

Test report, 71 tests
69 succeeded
2 failed
  long_run_config_many_medium
  long_run
0 stopped

I did try some other combinations of block/sector/page sizes, and I get other sets of tests failing. If I use the values from master, then all the tests pass.

Before I dig into this, is this something that you would expect with these values? Do you think it is more likely to be an issue with the test case?

Thanks

@pellepl
Copy link
Owner

pellepl commented Mar 15, 2016

Hi,

Yes you're most certainly safe. These testcases are pretty complicated and
depend much on the config, even though I made an attempt to keep them as
cfg-agnostic as possible.

Just to make sure, after each failed test the framework spits out "last
errno ". What error code do you get here, in both cases?
Den 15 mar 2016 13:36 skrev "Philip Gladstone" notifications@github.com:

I'm one of the nodemcu-firmware maintainers and we use spiffs as our
filesystem. I wanted to upgrade to the latest version, but when I tried to
run your self-tests with our page/sector/block parameters, I get self-test
failures.

My diff against master is:

--- a/src/test/params_test.h
+++ b/src/test/params_test.h
@@ -20,9 +20,9 @@
// test using filesystem magic length
#define SPIFFS_USE_MAGIC_LENGTH 1

-#define SECTOR_SIZE 65536
-#define LOG_BLOCK (SECTOR_SIZE*2)
-#define LOG_PAGE (SECTOR_SIZE/256)
+#define SECTOR_SIZE 4096
+#define LOG_BLOCK 4096
+#define LOG_PAGE 256

#define FD_BUF_SIZE 64_6
#define CACHE_BUF_SIZE (LOG_PAGE + 32)_8

After doing make clean all test, I get:

Test report, 71 tests
69 succeeded
2 failed
long_run_config_many_medium
long_run
0 stopped

I did try some other combinations of block/sector/page sizes, and I get
other sets of tests failing. If I use the values from master, then all the
tests pass.

Before I dig into this, is this something that you would expect with these
values? Do you think it is more likely to be an issue with the test case?

Thanks


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#77

@pjsg
Copy link
Contributor Author

pjsg commented Mar 15, 2016

I get -10001 (ERR_FULL). However, these come out of the fs consistency check.

Test case result for just '-f long_run'

test-run.txt

@pjsg
Copy link
Contributor Author

pjsg commented Mar 15, 2016

I did a bunch more runs with different combinations:

All sector sizes from 4096 - 65536, block = sector and block = sector * 2, and page between 128 and 1024.

Every single combination fails at least one test (except for 65536, 65536 * 2, 256). The output s=12 means sector_size=2^12, block=0 is block_size = sector_size * 2^0, page=10 is page_size=2^10

all-res/s=12,block=0,page=10:9 failed
all-res/s=12,block=0,page=7:6 failed
all-res/s=12,block=0,page=8:2 failed
all-res/s=12,block=0,page=9:1 failed
all-res/s=12,block=1,page=10:2 failed
all-res/s=12,block=1,page=7:64 failed
all-res/s=12,block=1,page=8:1 failed
all-res/s=12,block=1,page=9:1 failed
all-res/s=13,block=0,page=10:2 failed
all-res/s=13,block=0,page=7:64 failed
all-res/s=13,block=0,page=8:1 failed
all-res/s=13,block=0,page=9:1 failed
all-res/s=13,block=1,page=10:2 failed
all-res/s=13,block=1,page=7:6 failed
all-res/s=13,block=1,page=8:1 failed
all-res/s=13,block=1,page=9:1 failed
all-res/s=14,block=0,page=10:2 failed
all-res/s=14,block=0,page=7:6 failed
all-res/s=14,block=0,page=8:1 failed
all-res/s=14,block=0,page=9:1 failed
all-res/s=14,block=1,page=10:2 failed
all-res/s=14,block=1,page=7:5 failed
all-res/s=14,block=1,page=8:60 failed
all-res/s=14,block=1,page=9:1 failed
all-res/s=15,block=0,page=10:2 failed
all-res/s=15,block=0,page=7:5 failed
all-res/s=15,block=0,page=8:60 failed
all-res/s=15,block=0,page=9:1 failed
all-res/s=15,block=1,page=10:2 failed
all-res/s=15,block=1,page=7:5 failed
all-res/s=15,block=1,page=8:1 failed
all-res/s=15,block=1,page=9:1 failed
all-res/s=16,block=0,page=10:2 failed
all-res/s=16,block=0,page=7:5 failed
all-res/s=16,block=0,page=8:1 failed
all-res/s=16,block=0,page=9:1 failed
all-res/s=16,block=1,page=10:2 failed
all-res/s=16,block=1,page=7:4 failed
all-res/s=16,block=1,page=8:0 failed
all-res/s=16,block=1,page=9:60 failed

The various error codes from the various runs:

all-res/s=12,block=0,page=10: spiffs errno:-10001
all-res/s=12,block=0,page=10: spiffs errno:-10072
all-res/s=12,block=0,page=7: spiffs errno:0
all-res/s=12,block=0,page=7: spiffs errno:-10001
all-res/s=12,block=0,page=7: spiffs errno:-10007
all-res/s=12,block=0,page=7: spiffs errno:-10019
all-res/s=12,block=0,page=8: spiffs errno:-10001
all-res/s=12,block=0,page=9: spiffs errno:-10001
all-res/s=12,block=1,page=10: spiffs errno:-10001
all-res/s=12,block=1,page=10: spiffs errno:-10002
all-res/s=12,block=1,page=7: spiffs errno:0
all-res/s=12,block=1,page=7: spiffs errno:-10007
all-res/s=12,block=1,page=7: spiffs errno:-10019
all-res/s=12,block=1,page=7: spiffs errno:-10024
all-res/s=12,block=1,page=8: spiffs errno:-10001
all-res/s=12,block=1,page=9: spiffs errno:-10001
all-res/s=13,block=0,page=10: spiffs errno:-10001
all-res/s=13,block=0,page=10: spiffs errno:-10002
all-res/s=13,block=0,page=7: spiffs errno:0
all-res/s=13,block=0,page=7: spiffs errno:-10007
all-res/s=13,block=0,page=7: spiffs errno:-10019
all-res/s=13,block=0,page=7: spiffs errno:-10024
all-res/s=13,block=0,page=8: spiffs errno:-10001
all-res/s=13,block=0,page=9: spiffs errno:-10001
all-res/s=13,block=1,page=10: spiffs errno:-10001
all-res/s=13,block=1,page=10: spiffs errno:-10002
all-res/s=13,block=1,page=7: spiffs errno:0
all-res/s=13,block=1,page=7: spiffs errno:-10001
all-res/s=13,block=1,page=7: spiffs errno:-10007
all-res/s=13,block=1,page=7: spiffs errno:-10019
all-res/s=13,block=1,page=8: spiffs errno:-10001
all-res/s=13,block=1,page=9: spiffs errno:-10001
all-res/s=14,block=0,page=10: spiffs errno:-10001
all-res/s=14,block=0,page=10: spiffs errno:-10002
all-res/s=14,block=0,page=7: spiffs errno:0
all-res/s=14,block=0,page=7: spiffs errno:-10001
all-res/s=14,block=0,page=7: spiffs errno:-10007
all-res/s=14,block=0,page=7: spiffs errno:-10019
all-res/s=14,block=0,page=8: spiffs errno:-10001
all-res/s=14,block=0,page=9: spiffs errno:-10001
all-res/s=14,block=1,page=10: spiffs errno:-10001
all-res/s=14,block=1,page=10: spiffs errno:-10002
all-res/s=14,block=1,page=7: spiffs errno:0
all-res/s=14,block=1,page=7: spiffs errno:-10001
all-res/s=14,block=1,page=7: spiffs errno:-10007
all-res/s=14,block=1,page=7: spiffs errno:-10019
all-res/s=14,block=1,page=8: spiffs errno:-10024
all-res/s=14,block=1,page=9: spiffs errno:-10001
all-res/s=15,block=0,page=10: spiffs errno:-10001
all-res/s=15,block=0,page=10: spiffs errno:-10002
all-res/s=15,block=0,page=7: spiffs errno:0
all-res/s=15,block=0,page=7: spiffs errno:-10001
all-res/s=15,block=0,page=7: spiffs errno:-10007
all-res/s=15,block=0,page=7: spiffs errno:-10019
all-res/s=15,block=0,page=8: spiffs errno:-10024
all-res/s=15,block=0,page=9: spiffs errno:-10001
all-res/s=15,block=1,page=10: spiffs errno:-10001
all-res/s=15,block=1,page=10: spiffs errno:-10002
all-res/s=15,block=1,page=7: spiffs errno:0
all-res/s=15,block=1,page=7: spiffs errno:-10001
all-res/s=15,block=1,page=7: spiffs errno:-10007
all-res/s=15,block=1,page=7: spiffs errno:-10019
all-res/s=15,block=1,page=8: spiffs errno:-10001
all-res/s=15,block=1,page=9: spiffs errno:-10001
all-res/s=16,block=0,page=10: spiffs errno:-10001
all-res/s=16,block=0,page=10: spiffs errno:-10002
all-res/s=16,block=0,page=7: spiffs errno:0
all-res/s=16,block=0,page=7: spiffs errno:-10001
all-res/s=16,block=0,page=7: spiffs errno:-10007
all-res/s=16,block=0,page=7: spiffs errno:-10019
all-res/s=16,block=0,page=8: spiffs errno:-10001
all-res/s=16,block=0,page=9: spiffs errno:-10001
all-res/s=16,block=1,page=10: spiffs errno:-10001
all-res/s=16,block=1,page=10: spiffs errno:-10002
all-res/s=16,block=1,page=7: spiffs errno:0
all-res/s=16,block=1,page=7: spiffs errno:-10007
all-res/s=16,block=1,page=7: spiffs errno:-10019
all-res/s=16,block=1,page=9: spiffs errno:-10024

@pellepl
Copy link
Owner

pellepl commented Mar 15, 2016

I'm not surprised. It is very hard finding test case params for suitable for all possible combinations of configs. E.g. the nodemcu config is something i wouldn't dreamed off. 😉
Making pasta for the kids now, will look into it this evening.

Cheers

@pellepl
Copy link
Owner

pellepl commented Mar 15, 2016

Okidoki, let's begin with the error codes:

errors -10001 (SPIFFS_ERR_FULL) - most probably the metadata took too much space for the testcase to finish properly. Smaller page sizes gives more metadata and less file data.

errors -10002 (SPIFFS_ERR_NOT_FOUND) - same as above, the metadata probably filled the system too much to create a necessary file.

errors -10007 (SPIFFS_ERR_OUT_OF_FILE_DESCS) - increased page size, but fd buffer not increased accordingly, so the testcase simply ran out of them.

errors -10019 (SPIFFS_ERR_INDEX_LU) - this is an interesting one. Indicates a borked filesystem. I am not sure why we see this. What testcases/configs did you get this from?

errors -10024 (SPIFFS_ERR_NOT_CONFIGURED) - this a follow up error from the filesystem being built with a config that does not allow magic.

error -10072 (SPIFFS_VIS_END = SPIFFS_ERR_INTERNAL - 22) - actually not a real error, but shouldn't show up. Could have something to do with the check being run after each test.

So why the heck didn't I implement testcases that takes all possible configs? Well, because it is darn complicated.

Depending on pagesize and blocksize, you get different amount of metadata, and different amount of effective data in the fs. Yes, this can be calculated and used in the testcases, but that would definitely make the cases pretty unreadable.

Also depending on page- and blocksizes, some config switches simply won't work since there isn't one spare bit to store stuff (e.g. magic). E.g. all-res/s=16,block=1,page=9 with config magic off makes all testcases pass where 60 fails with magic on. Calculating beforehand what switches work with what configs would produce a pretty massive matrix full of maths.

Of course it is possible to do all this, but as of now, I simply do not have the time to write that kind of rigid test framework. If someone pays me, I might (but just might) :)

As for your original question - the test framework is stupid in a (or many) way(s). It outputs
fs consistency check:
and then prints output from the check callback. If you don't see anything specifically related to checks here, it means the check was successful. I.e. no output is good.
Then follows the test result, which in your case was FAILED, which makes all people on the globe but me think the consistency check failed. But it didn't.

So, the original problem is you getting SPIFFS_ERR_FULL. Tracking it down by setting last parameter in https://github.com/pellepl/spiffs/blob/master/src/test/test_hydrogen.c#L1638 and https://github.com/pellepl/spiffs/blob/master/src/test/test_hydrogen.c#L1817 to non-zero, it turns out that both tests fail when suddenly trying to create a large file.

Different config, more metadata, less data than expected in the testcases - it won't fit. Bam.

You're as safe as anyone running spiffs when all tests pass (da-da-daaaah).

As an epilogue to all this rambling, if I were a NodeMCU maintainer, I would consider trying LOG_BLOCK being 4*4096. At least you lose a broken-but-not-broken-for-real testcase :). It should also improve speed, but if you go that way, please test thoroughly. I don't know the size of NodeMCUs regular systems, for e.g. a 64k filesystem the original setting is a lot better.

@pjsg
Copy link
Contributor Author

pjsg commented Mar 15, 2016

Nodemcu has two filesystem sizes -- a really small one in the range 32k to maybe 100k (this is when there is a 512k flash chip) and about a 3.5Mb file system when we have a 4MB cache. I guess that different parameters may be appropriate for each size.

I'm going to dig into these errors more tonight and see if I can futz with the tests to get the nodemcu configurations to pass..... In particular, we run without MAGIC so I'll focus on those.

Thanks for your comments above.

The error -10019 all happen when the page size is set to 128 (and almost any config of block and sector). This may be too small to do anything decent. May be worth rejecting this size.....

@pellepl
Copy link
Owner

pellepl commented Mar 16, 2016

Ok, thanks for the input. Yes, having different configs for different sizes of flash is a good idea, but not mandatory of course.

Good luck with the futzing, yell if you need help or find something.

I'll check the 128 byte size here, I've been running a 128 byte page size spiffs on an ESP for some time now. It is odd, when defining test pagesize to128 some tests fail that specifically set the pagesize to 256. I smell a bug somewhere...

@pjsg
Copy link
Contributor Author

pjsg commented Mar 16, 2016

The test SECTOR=4096, BLOCK=4096, PAGE=1024, MAGIC=0 fails in test write_small_files_chunks_1 with

failed size, 0 != 512

I managed to get most of the tests to pass by reducing the size of the LARGE file to 1/12 of the flash size (from 1/3 of the flash size). In some cases, I can get away with the LARGE file at 1/6 of flash if I increase the FD_BUF_SIZE.

@pjsg
Copy link
Contributor Author

pjsg commented Mar 16, 2016

I now understand the problem -- but the solution is not immediately obvious.....

It turns out that when we want to write the LARGE file (as part of the test), essentially the entire file system consists of deleted pages. The GC is being asked for around 350k, and after five tries, it has only freed up 5 blocks. There are plenty of other blocks that could be freed, and the GC knows that they are there, but the logic only frees blocks once per "try". This is why, presumably, things work better with large block sizes.

I could increase the number of tries for the GC (quite a lot), but that limit is probably there for a reason. The GC might be smarter and be able to erase more blocks in a single pass. I'm not sure what the right approach is.

BLOCK=SECTOR=4096 PAGE=256 MAGIC=0 FD_BUF_SIZE=1024

Failing test: long_run_config_many_medium

@pellepl
Copy link
Owner

pellepl commented Mar 16, 2016

Ah yes, I tracked it down to the same. Good finding! Your reasoning is correct above.

Two things: first, when writing the test cases I had 64k blocks in mind. Secondly, this influenced the cap to being 5, as a 64k block takes more time to erase than a 4k block, and I assumed that ~300k would be a "very large" file in an embedded system, if I remember correctly.

The cap is there for safety. If there is a bug in the GC algo making it go into an eternal loop, we would wear the flash hard and never return. This is still unseen though.

With this in mind, you could crank it up to something more valid for a 2 megabyte, 4k block sized system. I'd say 256 perhaps would be a better figure. It all depends how the system is used.

I also took a review of the test bench and found some ugly bugs. Fix will come up in an hour or two. This fix will probably make you're exhaustive list look a lot better ;)

Also had some thinking on the nodemcu config. With 4096 bytes blocks and 256 bytes pages, you get 16 pages per block. In each block, one page is stolen for the lookup table. Cranking up the block size to the double would give better usage of the spi flash I think.

EDIT: Oh, and I am sorry to say, no - the GC is not smarter. Actually, in this sense it might be a bit stupid as it sometimes actually frees full block, yielding no free pages.

The reason is that spiffs honor wear leveling before freeing space. Say you create a large file spanning many blocks. Then you create and delete a lot of smaller files during a longer period. All this time, the large file will hog its blocks, which will lead to more wear on the other blocks. Instead, spiffs recognizes this by cleaning blocks not being cleaned for a long time. The assumption is that by then, the file system will be a bit more fragmented, which will break up the huge hog of a file over the system, making wear leveling fair.

@pjsg
Copy link
Contributor Author

pjsg commented Mar 17, 2016

Thanks for the test changes. I re-ran my battery of tests, and now 51 pass cleanly and 39 fail.

I have put my code in pjsg/spiffs:tests -- there is a perl script at the top level which will run a battery of 90 configurations and put the results in all-gcf subdirectory.

grep failed * | grep -e ":[0-9]" | grep -v ":0"

then reveals all the failed configurations.

For example, with configuration: Sector=4096, Block=8192, Page=1024, Magic=0

TEST 5/68 : running test page_cons3
    create and write file
   check: PAGE   FIX INDEX 8001:0000
   check: PAGE   DELETE BAD FILE 8001
   check: PAGE   DELETE PAGE 0002
   check: PAGE   DELETE PAGE 0003
   check: PAGE   DELETE PAGE 0004
  read_and_verify: could not open file file
  TEST FAIL src/test/test_check.c:213

It looks as though all the page=1024 failed. A number failed with -10024 with magic=1. I suspect that this is due to an invalid combination of page/block/sector sizes.

There are a few other failures that may be related to large blocks and small pages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants