Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange issue with filesystem - null pointer de-reference in SPIFFS_check() #93

Closed
dismirlian opened this issue Jun 10, 2016 · 11 comments
Closed

Comments

@dismirlian
Copy link

dismirlian commented Jun 10, 2016

Hi pellepl, first of all, thank you for sharing this great software.

I'm having trouble with SPIFFS. I have a 4MB flash memory (S25FL132K), half of which (2MB) is assigned to SPIFFS. When I want to create a file using SPIFFS_open(&fs, "d43.fw", SPIFFS_TRUNC | SPIFFS_CREAT | SPIFFS_WRONLY, 0), SPIFFS_open fails with -10002 (SPIFFS_ERR_NOT_FOUND).

Moreover, running SPIFFS_check results in a null-pointer de-reference in the function spiffs_object_index_consistency_check_v, specifically in the line obj_table[*log_ix] = obj_id & ~SPIFFS_OBJ_ID_IX_FLAG (there are two of those lines; the one that triggers the error is the one below the comment // found, register as reachable).

If I close the program and re-run, SPIFFS_check succeeds, but the filesystem is in a weird state (I can see there's a file, but the filename is wrong).

I can't tell you if there has been a power failure which caused this behaviour.

One question: how can I know if I need to run SPIFFS_check()? This function is slow, and I don't want to run it if I don't have to.

You can download the 2MB filesystem dump from this link.

Thanks again!

Regards,
Diego.

PS: my config below.

define SPIFFS_CACHE 1

define SPIFFS_CACHE_WR 1

define SPIFFS_CACHE_STATS 1

define SPIFFS_PAGE_CHECK 1

define SPIFFS_GC_MAX_RUNS 5

define SPIFFS_GC_STATS 1

define SPIFFS_GC_HEUR_W_DELET (5)

define SPIFFS_GC_HEUR_W_USED (-1)

define SPIFFS_GC_HEUR_W_ERASE_AGE (50)

define SPIFFS_OBJ_NAME_LEN (32)

define SPIFFS_COPY_BUFFER_STACK (64)

define SPIFFS_USE_MAGIC (1)

define SPIFFS_USE_MAGIC_LENGTH (0)

define SPIFFS_SINGLETON 1

define SPIFFS_CFG_PHYS_SZ(ignore) (2048*1024)

define SPIFFS_CFG_PHYS_ERASE_SZ(ignore) (4096)

define SPIFFS_CFG_PHYS_ADDR(ignore) (0)

define SPIFFS_CFG_LOG_PAGE_SZ(ignore) (256)

define SPIFFS_CFG_LOG_BLOCK_SZ(ignore) (4096)

define SPIFFS_ALIGNED_OBJECT_INDEX_TABLES 1

define SPIFFS_FILEHDL_OFFSET 0

define SPIFFS_READ_ONLY 0

typedef u16_t spiffs_block_ix;
typedef u16_t spiffs_page_ix;
typedef u16_t spiffs_obj_id;
typedef u16_t spiffs_span_ix;

@pellepl
Copy link
Owner

pellepl commented Jun 10, 2016

Hi Diego,

thanks for a detailed report. I'll look into it as soon as I can. When you rerun your app, what is the bad filename? Just garbage characters?

As for when to run SPIFFS_check, there are some thoughts on this on the wiki. I myself am using power backupped ram for denoting if I had a reset within a spiffs operation.

Cheers / Peter

@pellepl
Copy link
Owner

pellepl commented Jun 11, 2016

Had a quick peek on the image - it looks weird. What target is this, and what compiler are you using?

@dismirlian
Copy link
Author

Hi Peter, thanks for looking into this.

When I rerun the app, the filename is just garbage. Thanks for the pointer to the wiki, I had missed that. Probably I can do the same as you are doing.

The target is STM32F4 (ARM Cortex M4F) and the compiler is GCC. However, I also compile the app using MSVC under Windows, and works well in both platforms. The weird filesystem was generated by a STM32 device, and I dumped the entire flash.

Diego.

@pellepl
Copy link
Owner

pellepl commented Jun 13, 2016

Hi Diego,
findings so far:

First of all, the typedefs for spiffs_page_ix, spiffs_obj_id, spiffs_span_ix seem to be u32_t instead of u16_t given the image. This might explain why you get a nullpointer dereference, though I need to investigate that zero deref further.

Secondly, from the image, it looks like spiffs got aborted while writing a file td43J.fw - seems it got as long as roughly 32k.

Now, the scary part is that running a check on this will not succeed to mend the file. The check is a beast, and I'd need to dig further down to find out what is happening.

(I actually have some plans to rewrite the check altogether, but that will take some time and would lead to a new fs structure, so backwards compatibility would be lost. Different story, but still.)

I'll let you know as soon as I have more info.

Cheers / Peter

@dismirlian
Copy link
Author

Hi Peter,

Thanks for the update. You are right!! (about spiffs_page_ix, spiffs_obj_id, spiffs_span_ix being 32 bits). In the spiffs_config.h file, I defined:

typedef int32_t s32_t;
typedef uint32_t u32_t;
typedef int16_t s16_t;
typedef uint32_t u16_t;
typedef int8_t s8_t;
typedef uint8_t u8_t;

This bug has been in my source code for more than a year, and everything worked as expected! Even if I correct this bug (I typedef u16_t as uint16_t, and re-define spiffs_block_ix, spiffs_page_ix, spiffs_obj_id and spiffs_span_ix as u32_t, to maintain compatibility with the image), the null pointer de-reference still exists...

it looks like spiffs got aborted while writing a file td43J.fw - seems it got as long as roughly 32k.

This is a good hint, I'll try to see if I can find something in my software...

Thanks!
Diego.

@pellepl
Copy link
Owner

pellepl commented Jul 28, 2016

Hi again Diego, sorry for long delay.

The filesystem you have there is pretty borked to say the least. I still haven't been able to back track what could have happened with any certainty.

However, some stuff I revealed at least. Getting SPIFFS_ERR_NOT_FOUND when issuing SPIFFS_open with TRUNC and CREATE is of course bad. This happens because a garbage collection commences which will try to move a file that is inconsistent (have been totally broken). I need to address this in the gc, so future versions of gc will keep going even if it stumbles on a broken file.

And as you say, the CHECK mends thing in a bad way. I think this is because the check have two pages with same id, and cannot decide which one is correct. In this case, it chooses the bad one. I need to check this more.

I cannot reproduce the nullpointer dereference. Could this have something to do with different endianess on different targets (PC/ARM)?

I'll dig further.

Cheers / Peter

pellepl added a commit that referenced this issue Jul 31, 2016
@dismirlian
Copy link
Author

Hi Peter, how are you? No problem, it's not like I've paid for commercial
support ;).

At least the gc doesn't crash in the event of a bad filesystem; in this
case SPIFFS_open only returns an unexpected result code.

I'll check the null pointer de-reference in ARM. The endianess is the same
on both targets (little), maybe some alignment is different, or maybe it
has to do with the fact that I messed up with the definition of u16_t?

I hadn't checked on ARM because the broken filesystem has been dumped from
a remote device, to which I don't have access (it's 1200km away...). I'll
write the filesystem to a development device, to test on ARM, and report
back.

Thanks again!
Diego.

On Thu, Jul 28, 2016 at 5:19 AM, Peter Andersson notifications@github.com
wrote:

Hi again Diego, sorry for long delay.

The filesystem you have there is pretty borked to say the least. I still
haven't been able to back track what could have happened with any certainty.

However, some stuff I revealed at least. Getting SPIFFS_ERR_NOT_FOUND when
issuing SPIFFS_open with TRUNC and CREATE is of course bad. This happens
because a garbage collection commences which will try to move a file that
is inconsistent (have been totally broken). I need to address this in the
gc, so future versions of gc will keep going even if it stumbles on a
broken file.

And as you say, the CHECK mends thing in a bad way. I think this is
because the check have two pages with same id, and cannot decide which one
is correct. In this case, it chooses the bad one. I need to check this more.

I cannot reproduce the nullpointer dereference. Could this have something
to do with different endianess on different targets (PC/ARM)?

I'll dig further.

Cheers / Peter


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#93 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AQXw4yVJDhFPZlQyKSwLF8cPuxeIEy7yks5qaGX7gaJpZM4IzMpK
.

@pellepl
Copy link
Owner

pellepl commented Aug 1, 2016

Hi Diego, pretty fine thank you, apart from minor anxiety that my vacation is soon to an end ;)

Regarding the null deref - I've actually been trying to reproduce it on PC (using the test framework) all along without success. Are you simply running SPIFFS_check on the image you gave me before? Might it be another spiffs version?

I've fixed the gc in commit 81fb3fa, making a bit more robust. It failed since it found a data page
not being referenced by any index page, which in turn yielded the ERR_NOT_FOUND. Now, if the gc finds such pages, it simply deletes them. As they are not referenced, they do not belong to any file. This is most probably because a powerloss when writing a lengthy file.

@pellepl
Copy link
Owner

pellepl commented Aug 7, 2016

Hi again,
Well - I've been looking thru almost every bit now. I've found parts of written data that "never will happen".
My only explanation is

  1. bad spiflash - have you seen this on other targets?
  2. flakey spi bus - perhaps too high bus speed?
  3. stack corrption - seen other crashes on target?
  4. mutex issues - running threaded environment?

Covering the sequences I've seen would bloat the code significantly and be a major pain in the ****. To be fully honest with you, I'm a bit reluctant to do this. Have you encountered similar problems on other targets, or is this a single device failure?

@dismirlian
Copy link
Author

Hi Peter,

Thanks for your effort. This has been so far a single device failure, so...

  1. I haven't seen this on other units.
  2. The SPI bus' speed is definitely not too high, and the signal integrity
    seems fine...
  3. No I haven't seen that, but I cannot rule it out... However, I have the
    RTOS' "stack corruption detection" feature on.
  4. Yes, I'm running a multi-threaded environment, but the filesystem is
    accessed from only one of them...

As you say, this is very weird, and it's not worth the bloat and the amount
of work!

Of course, I'm watching the devices closely, and there haven't been other
issues.

Thanks again for your effort, and sharing your software!

Regards,
Diego.

On Sun, Aug 7, 2016 at 10:34 AM, Peter Andersson notifications@github.com
wrote:

Hi again,
Well - I've been looking thru almost every bit now. I've found parts of
written data that "never will happen".
My only explanation is

  1. bad spiflash - have you seen this on other targets?
  2. flakey spi bus - perhaps too high bus speed?
  3. stack corrption - seen other crashes on target?
  4. mutex issues - running threaded environment?

Covering the sequences I've seen would bloat the code significantly and be
a major pain in the ****. To be fully honest with you, I'm a bit reluctant
to do this. Have you encountered similar problems on other targets, or is
this a single device failure?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#93 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AQXw47fX5m1fH6C4C1a0uLD61N_o6SQ7ks5qdd70gaJpZM4IzMpK
.

@pellepl
Copy link
Owner

pellepl commented Aug 7, 2016

Phew! :) Might just be a bad spiflash, solar winds or an act of god then :)

Thanks for the kind words! I'll close this issue, but if you see it again please reopen and we'll have another run.

As mentioned before, I have some plans to do a full rewrite of the check parts.

@pellepl pellepl closed this as completed Aug 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants