Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On Windows, Space is incorrectly given 1970 start time #906

Closed
philrz opened this issue Jun 13, 2020 · 6 comments · Fixed by #938
Closed

On Windows, Space is incorrectly given 1970 start time #906

philrz opened this issue Jun 13, 2020 · 6 comments · Fixed by #938
Assignees
Labels
bug Something isn't working
Milestone

Comments

@philrz
Copy link
Contributor

philrz commented Jun 13, 2020

I'm pretty sure I've seen this issue in the past, but I've only now come up with a formula for reproducing it with some reliability. This repro was in GA Brim tagged v0.11.0 talking to zqd tagged v0.15.0. I've only been able to reproduce it on Windows. However, because it doesn't repro 100% of the time, it makes me uncertain if it's Windows-specific in a code-centric way or if it's just down to timing, i.e. somehow a side-effect of things running a bit slower on Windows.

I've found I can reliably reproduce this with the (uncompressed) pcap at https://archive.wrccdc.org/pcaps/2018/wrccdc.2018-03-23.010014000000000.pcap.gz. The trick seems to be dragging the pcap into the first tab, then quickly dragging another pcap into one or more other tabs before bouncing back to the first tab. I've previously repro'ed it by dragging the wrccdc pcap into the first tab and a different pcap into a second tab, but since it repros with the same wrccdc pcap used over & over, I've stuck with that here for simplicity.

As you can see in the attached video, what happens is that the start time for the created Space incorrectly shows a start time of 1970. We've seen this happen when time-free Zeek events have been generated that get assigned a timestamp of ts=0, but that's not what's going on here, evidenced by the fact that:

  1. If you let the pcap load on its own in a single tab without disturbing it (not shown in the video, but you can do this on your own), you never get any such events.

  2. Even with the 1970 start times shown on all four tabs as in the video, once they were all done loading, I re-dragged one of their all.zng files into yet another tab and the Start time in the picker did not show 1970, indicating no such ts=0 events were found.

In the video I show an error that seems to pop up in Dev Tools. I also show in the video that the info.json created by zqd shows a min_time of 0, so this seems to absolve the Brim app from any wrongdoing.

Once again, I'll caution that this is not reproducible 100% of the time, so don't be surprised if you get it wrong on the first try. Since it took me a while to come up with this recipe and I respect that others may struggle to get it right, I'm game to do more repros to help with debug if people tell me what they need done.

Repro.mp4.zip

@mikesbrown
Copy link
Contributor

This can happen in CI conditions too, e.g., https://github.com/brimsec/brim/runs/777954531 . It manifests itself with the histogram test failing because it's not rendered properly. Screen shot shows a histogram starting in 1970.

I unpacked the artifacts and found the Brim directory and space directory for the offending space (the histogram test). You can put its contents into the brim.git/run directory and run npm run electron to see the problem. Also,

$ jq . run/data/spaces/sp_1dPrjZaiZdAaduXZauBV5rKBoyn/info.json
{
  "min_time": {
    "sec": 0,
    "ns": 0
  },
  "max_time": {
    "sec": 1582646597,
    "ns": 838527001
  }
}

906.zip if you want it.

@philrz
Copy link
Contributor Author

philrz commented Jun 16, 2020

Thanks @mikesbrown... that's surely good info to know. As long as we're piling on evidence, I'll share something else CI-related I mentioned to @mattnibs and @alfred-landrum offline. At one point when I trying lots of variations to see how reliably I could reproduce the "1970" symptom, I managed to trigger an "Access is Denied" message similar to the one described in #613. It only happened that one time, and maybe it was a coincidence and has a separate root cause. Or... maybe we're lucky & they have the same root cause. I know @alfred-landrum has been unable to repro #613 when he's added extra logging in branches. So maybe we'll fix this issue and find the other will go away as well. 🤞

@philrz
Copy link
Contributor Author

philrz commented Jun 22, 2020

Now that #613 has a fix, I created a current Windows artifact to see if this issue was chased away, but alas, it's still with us. A note-to-self or others who may try to repro in the future: It seems like deleting the appState.json each time I close & re-launch Brim (i.e. simulating a user coming to the app "fresh" for the very first time) is helpful in getting a more reliable repro.

@mikesbrown
Copy link
Contributor

@alfred-landrum
Copy link
Contributor

The root cause here is:

  • when a pcap is ingested, we first set the time range for the space using pcap index data, before we actually ingest any records
  • when we begin feeding the pcap data to zeek, there can be a brief moment where there are zeek log files (dns.log, etc) present, but empty
  • when we read the empty files, we then tried to update the space time range by joining it with the existing time (from the pcap) with a zero time (because no records were read)

brim-bot pushed a commit to brimdata/zui that referenced this issue Jun 25, 2020
… no records" by alfred-landrum

This is an auto-generated commit with a zq dependency update. The zq PR
brimdata/super#938, authored by @alfred-landrum,
has been merged.

dont update the filestore span from a rewrite with no records

Closes brimdata/super#906
@philrz
Copy link
Contributor Author

philrz commented Jun 25, 2020

Verified on Brim commit 26332a9 talking to zqd commit d04c83e.

With this class of bug that has an unreliable repro, it's always a little tough to know for sure that we've chased it away. However, the approach of deleting appState.json before each launch as I described in my last comment seemed to give a reliable repro, so here with a newly-created Windows artifact I did five repro attempts in a row using that approach and never saw the symptom again. So between that and having faith in the analysis that led to the fix, I think we're good. 😄

Thanks @alfred-landrum!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants