Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[entropy_src,verilator] Reduce entropy tests requirements on Verilator to fix failing test targets #24528

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

AlexJones0
Copy link
Contributor

This PR partially addresses #24184 (specifically, it addresses the two tests found to fail if given long enough to run in Verilator).

Entropy is generated much more slowly in Verilator, and as a result, two tests were timing-out (and if given long enough, failing) when run in Verilator. These are sw/device/tests:entropy_src_fw_observe_many_contiguous_test_sim_verilator, and sw/device/tests:entropy_src_fw_override_test_sim_verilator. This PR conditionally lowers the test requirements for the Verilator simulation environment to allow these tests to pass on Verilator.

For fw_observe_many_contiguous_test, the number of entropy samples to be observed is reduced from 1024 to just 8, as the full 1024 would take around 13-15 hours. For fw_override_test, the entropy consumers have been limited to just random number generation, and the test is terminated after this on Verilator. The timeout given to observe this Entropy is also increased 30x on Verilator.

Note that these code changes do not modify the operation of any execution environments other than Verilator; only Verilator has had conditional changes. If this is deemed not appropriate due to requirements of these tests, then instead the Verilator environment should be removed from these two tests.

This has been tested by running the following command:

./bazelisk.sh test -t- --test_output=streamed --test_timeout=3600,3600,3600,3600 \
    //sw/device/tests:entropy_src_fw_observe_many_contiguous_test_sim_verilator 
    //sw/device/tests:entropy_src_fw_override_test_sim_verilator

which now passes as expected:

//sw/device/tests:entropy_src_fw_observe_many_contiguous_test_sim_verilator PASSED in 2807.4s
//sw/device/tests:entropy_src_fw_override_test_sim_verilator             PASSED in 3358.1s

Reduces the number of contiguous entropy samples observed when running
on Verilator to just 8 (down from 1024). This is because observing the
full 1024 would require increasing the timeout greatly and would mean
that it would take around 13-15 hours to run the test. Without
increasing the timeout, the test currently fails due to not meeting the
required rate.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
@AlexJones0 AlexJones0 requested a review from a team as a code owner September 5, 2024 17:20
@AlexJones0 AlexJones0 requested review from HU90m, pamaury and jwnrt and removed request for a team and HU90m September 5, 2024 17:20
Copy link
Contributor

@pamaury pamaury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that the main purpose of this test is to run on the FPGA or a real device. Running this test in verilator only makes sense for debugging, in which case it does not necessarily matter if it takes many hours to run.
Therefore it's not clear to me if this tests should be just marked as manual for verilator instead. I think @vogelpi should have a look at this since he designed the testplan for that.

@AlexJones0
Copy link
Contributor Author

That makes sense, thanks. I agree that we should wait for @vogelpi's opinion on this in that case.

Therefore it's not clear to me if this tests should be just marked as manual for verilator instead.

One small correction - in the case where we do not want to make these changes, then marking the tests as manual for Verilator will not be sufficient, as when manually executing these tests, they will fail with either Internal:["ENT",176] in the case of entropy_src_fw_observe_many_contiguous_test, or DeadlineExceeded:["ENT",367] in the case of entropy_src_fw_override_test. To retain e.g. 1024 contiguous samples, we would still need to modify the test to conditionally increase the timeout to on Verilator, or we could just remove the Verilator execution environments entirely. It otherwise doesn't make sense to me to have a test that is marked as supporting Verilator which you can run manually, but which is actually guaranteed to fail.

@jwnrt
Copy link
Contributor

jwnrt commented Sep 6, 2024

I think we should apply this change and mark manual. @pamaury is right that it’s not useful in Verilator, but if you really really have to use it to debug something then it should at least work.

@AlexJones0
Copy link
Contributor Author

I think we should apply this change and mark manual.

I agree, that sounds sensible. Regardless of whether this change is applied or not, it makes sense to mark the Verilator target as manual if the environment stays around because, based on what has been described so far, the Verilator test target only really makes sense for debugging purposes.

Copy link
Contributor

@vogelpi vogelpi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looping me in @AlexJones0 and @pamaury !

I've left some comments on how to change the fw_override_test in a way to not reduce the scope of the test too much but still speed it up for Verilator. Would you mind giving this a try please?

You are right: this test is mainly useful for SiVal purposes (non-Simulation platforms) but having this running in Verilator is very useful for debugging. So re-enabling this is valuable and the approach taken here is valid for the Verilator simlation environment. Thanks @AlexJones0 for doing this work!

Comment on lines 76 to 82
kRandomNumberTimeoutUsec = 100 * 1000,
/**
* Timeout to generate a random number in micro seconds when running in a
* Verilator simulation environment. Verilator observes entropy more slowly
* than other environments, and so is given a longer timeout.
*/
kVerilatorRandomNumberTimeoutUsec = 3 * 1000 * 1000,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a factor of 30x difference here and this strongly suggests that we probably never cared about tuning the rate of the raw noise source (inside AST) for the Verilator simulation. For context, to speed up the Verilator simulation, we use different clock frequences and ratios as well as baud rates for Verilator. See e.g. sw/device/lib/arch/device_sim_verilator.c.

If you compare these values with the values for FPGA (for which this test was initially designed), you'll note an factor of 48x or similar difference. The RTL model used on FPGA and in Verilator is the same though. I don't remember exactly at which clock the AST model is running. I think what we should really do is tuning the AST model for the Verilator simulation. Would you mind creating a GitHub issue for this please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, thanks - I've made that issue here: #24585

Please do mention if there's anything that I've missed capturing in that issue. I had a quick look at the RTL of the ADC as well as the device clock/ratio definitions for the Verilator and FPGA environments and I suspect that as you say it is this 48x difference in clock speeds that necessitates this increase in timeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating the issue!

Comment on lines 395 to 402
if (kDeviceType == kDeviceSimVerilator) {
// If running on Verilator then entropy is generated much more slowly.
// It would take an impractical amount of time to generate the thousands
// of words of entropy that are required by all entropy consumers, and so
// since we are not concerned with the rate of entropy on Verilator, we
// stop at just generating some random numbers only.
return OK_STATUS();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is non-ideal because it means we skip quite a big portion of the main test function in Verilator. So the Verilator simulation becomes less useful for debugging as a big part of the design is not exercised anymore.

Another way to speed up the Verilator simulation would be to reduce the number of modes tested. E.g. for Verilator only we could change:

  // Test all modes.
  static dif_entropy_src_single_bit_mode_t kModes[] = {
      kDifEntropySrcSingleBitModeDisabled, kDifEntropySrcSingleBitMode0,
      kDifEntropySrcSingleBitMode1,        kDifEntropySrcSingleBitMode2,
      kDifEntropySrcSingleBitMode3,
  };

to

  // Test all modes.
  static dif_entropy_src_single_bit_mode_t kModes[] = {
      kDifEntropySrcSingleBitModeDisabled, kDifEntropySrcSingleBitMode0,
  };

This would just check 2 instead of 5 modes (..Mode0 / .. ..Mode3 are almost identical and all of them are 4x slower compared to ...ModeDisabled).

Also, the final phase:

  // Rerun the test with single bit mode disabled,
  // this time with an output delay.
  fw_ov_insert_wait_enabled = true;
  EXECUTE_TEST(test_result, firmware_override_extract_insert,
               kDifEntropySrcSingleBitModeDisabled, false, false);
  EXECUTE_TEST(test_result, firmware_override_extract_insert,
               kDifEntropySrcSingleBitModeDisabled, true, false);

Could be skipped for Verilator. We don't learn a lot from this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, this was one thing that I was not sure on.

Making your suggested changes (just testing the first two modes and skipping the final phase on Verilator), alongside the same timeout changes as before, causes the test to pass in 6656.2 seconds (around 1 hour 50 minutes) in an environment similar to CI. It is not clear to me whether properly tuning the AST model as suggested would change this.

This seems fine to me for debug purposes - it is far better than the 15+ hours it was taking before. Ideally we would like it to run in under an hour, but given the purpose of this test I think it makes more sense to tag the Verilator environment as manual for entropy_src_fw_override_test with a comment explaining the runtime, alongside these changes. Does this sound reasonable to you @vogelpi?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for trying this out and reporting back. Tuning the AST model would help on top of this but we don't have to do this right now.

This seems fine to me for debug purposes - it is far better than the 15+ hours it was taking before. Ideally we would like it to run in under an hour, but given the purpose of this test I think it makes more sense to tag the Verilator environment as manual for entropy_src_fw_override_test with a comment explaining the runtime, alongside these changes. Does this sound reasonable to you @vogelpi?

This would be perfect. Can you implement these changes directly in this PR?

Copy link
Contributor

@vogelpi vogelpi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @AlexJones0 !

This commit modifies the `entropy_src_fw_override_test` such that when
running on Verilator, the random number consumer and the AES test utils
are both given 48x larger timeouts, to account for the fact that the
AST's noise source has not been tuned for a Verilator execution
environment. This stops these specific tests failing on Verilator.

Additionally, the requirements of the test have been conditionally
reduced when executing on a Verilator simulation environment. The final
phase of testing is skipped (single bit mode disabled, without an
output delay) as this provides negligible useful information when
running on Verilator, and only the `kDifEntropySingleBitModeDisabled` &
`kDifEntropySingleBitMode0` modes are used to run the tests, as modes
1 through 3 are almost identical and all are much slower than
ModeDisabled. This provides the speedup necessary for the test to
succeed (given reasonable timeout) in the Verilator sim environment.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
@AlexJones0
Copy link
Contributor Author

I forgot to mark the Verilator environment for the fw_override test as manual in that push so I've just added that now - apologies!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants