Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getting Null as container name #925

Closed
nebi-frame opened this issue Nov 14, 2019 · 32 comments · Fixed by #1099
Closed

getting Null as container name #925

nebi-frame opened this issue Nov 14, 2019 · 32 comments · Fixed by #1099
Assignees
Labels
Milestone

Comments

@nebi-frame
Copy link

nebi-frame commented Nov 14, 2019

What happened:
Sometimes we are getting null as container.name, parent process name from Falco.
What you expected to happen:
A real value
How to reproduce it (as minimally and precisely as possible):
Listen for any event and look at what you get for container.name
Anything else we need to know?:
We raised this issue on slack channel and some members suggested us to make sure falco starts after docker so we put sleep statements. Unfortunately, It seems problem still occurs.
Environment:

  • Falco version (use falco --version): 0.17.1 - 0.18.0
  • System info
  • Cloud provider or hardware configuration: AWS ECS AMI - ami-084f07d75acedcefa
  • OS (e.g: cat /etc/os-release): Amazon Linux 2, ECS
  • Install tools (e.g. in kubernetes, rpm, deb, from source):
curl -o /tmp/install-falco.sh -s https://s3.amazonaws.com/download.draios.com/stable/install-falco
sudo bash /tmp/install-falco.sh

sudo mv $TEMPLATE_DIR/falco/falco.yaml /etc/falco/falco.yaml
sudo mv $TEMPLATE_DIR/falco/falco_rules.yaml /etc/falco/falco_rules.yaml

# Setup logrotate to run every 5mins
sudo mv $TEMPLATE_DIR/falco/logrotate /etc/logrotate.d/falco.conf
sudo chown root /etc/logrotate.d/falco.conf
echo "*/5 * * * * /usr/sbin/logrotate -f /etc/logrotate.d/falco.conf" | sudo crontab -
  • Others:
@fntlnz
Copy link
Contributor

fntlnz commented Nov 27, 2019

/milestone 0.19.0

@poiana poiana added this to the 0.19.0 milestone Nov 27, 2019
@fntlnz
Copy link
Contributor

fntlnz commented Dec 3, 2019

/assign @fntlnz

@fntlnz
Copy link
Contributor

fntlnz commented Dec 3, 2019

/assign @leodido

@fntlnz
Copy link
Contributor

fntlnz commented Dec 4, 2019

From repo planning: @mfdii suggests that this has something to do with how we cache metadata.

@fntlnz
Copy link
Contributor

fntlnz commented Dec 5, 2019

I confirm that I'm able to reproduce this

docker run --rm  -it debian:jessie apt update
Dec 05 13:56:36 ip-172-31-20-33.ec2.internal falco[30945]: 13:56:36.280134044: Error Package management process launched in container (user=root command=apt update container_id=eb04a288be40 container_name=<NA> image=<NA>:<NA>)

@fntlnz
Copy link
Contributor

fntlnz commented Dec 5, 2019

After a very long bisect session I've been able to identify that this was introduced after this PR in sysdig: draios/sysdig#1326

@fntlnz
Copy link
Contributor

fntlnz commented Dec 5, 2019

The problem is that container metadata is now fetched asynchronously, so when Falco detects the event there's a possibility that container metadata hadn't been fetched yet, this is why sometimes we see the metadata and sometimes not.

@leodido
Copy link
Member

leodido commented Dec 20, 2019

We are still working on this.

I don't feel like we have room of improvement in this part until we introduce a processing queue (which is in plan for when we do the input gRPC api - #908).

We could solve this by adding some kind of synchronization mechanism like a mutex but on most of the systems this would increase a lot the drop rate. So the plan is to postpone this until gRPC input interface is done and we have the syscall input with metadata fetching done.

/milestone 1.0.0

@nebi-frame
Copy link
Author

Hey guys,
Do we have any timeline for 1.0.0 release?
Thanks.

@krisnova
Copy link
Contributor

krisnova commented Jan 6, 2020

Hey @nebi-frame,

We haven't planned a date, but we can bring it up on the Wed call. Do you think you will be able to attend so we can talk more about this?

@nebi-frame
Copy link
Author

nebi-frame commented Jan 7, 2020 via email

@holdenk
Copy link

holdenk commented Jan 23, 2020

Is this a thing where folks are looking for help?

@markjacksonfishing
Copy link
Contributor

Yes!!! and welcome @holdenk

@krisnova
Copy link
Contributor

Yes - we have an end user who is asking for a fix here - I am sure they would appreciate anyone looking at this. We can pay with stickers, t-shirts, hugs, and coffee.

@holdenk
Copy link

holdenk commented Jan 23, 2020

Awesome, if you've got some time I'll bug you with some detail questions as I start to get lost in a new code base :)

@holdenk
Copy link

holdenk commented Jan 27, 2020

From a quick chat in falco.cpp look where we call next and add mutexes around there. And then look for where the container metadata is coming from idk? I think I missed something ehre @kris-nova.

@krisnova
Copy link
Contributor

So check out the main daemon loop in falco.cpp and look how we call next(&ev) with the event pointer.

The mutex will need to lock right after this method call, and also wait for the "container metadata" context that comes from lua here

We need to make it such that Falco can't go into deadlock if for some reason Lua returns successful != true

We also need this mutex to be an opt-in feature that is exposed via a feature flag. Probably something like --wait-for-container-context. All of the feature flags are defined in falco.cpp here.

@fntlnz
Copy link
Contributor

fntlnz commented Jan 27, 2020

@kris-nova @holdenk maybe we don't need a mutex here. The asynchronous nature of container metadata is what is causing the problem here so a thing we can do is to add a flag to tell container metadata to not be asynchronous.
It doesn't solve the problem when we actually want that to be synchronous but it's a thing we can do now for those that need container metadata 100% of the time while we wait for the Input API.

Asynchronous metadata fetches can be deactivated by calling set_cri_async(false); on the sinsp class.

In our case, it can be done in falco.cpp

Wdyt?

@srivastavaabhinav
Copy link

@fntlnz what's the decision here?

@holdenk
Copy link

holdenk commented Feb 13, 2020

@fntlnz would making this call be sync slow things down unacceptably?

@leodido
Copy link
Member

leodido commented Feb 13, 2020

@holdenk IMHO yes.

Also, that would increase the drop rate noticeably.

A temporary solution could be to introduce a flag for Falco that forces the inspector - ie., sinsp - (technically, its container manager, class sinsp_container_manager) to fetch the container metadata synchronously.

To do it there is this method: inspector->set_cri_async(), source (here).

Thanks for looking into this 🤗

@fntlnz
Copy link
Contributor

fntlnz commented Feb 19, 2020

Discussing this in today's call to make a decision

@leodido
Copy link
Member

leodido commented Feb 19, 2020

/assign @fntlnz

@nebi-frame
Copy link
Author

Can I join the call as well?

@leodido
Copy link
Member

leodido commented Feb 19, 2020

Sure, https://sysdig.zoom.us/my/falco

@fntlnz
Copy link
Contributor

fntlnz commented Feb 19, 2020

We made a decision, we are going to fix this by making the synchronous mode opt-in for those who want it.

In this way we can fix the problem for everyone right now and we can focus on the inputs api to solve the root cause as leo was mentioning here.

@nebi-frame
Copy link
Author

Thanks everyone, I'm excited for the synchronous mode 👍

@leodido
Copy link
Member

leodido commented Feb 20, 2020

/milestone 0.21.0

@poiana poiana modified the milestones: 1.0.0, 0.21.0 Feb 20, 2020
@nebi-frame
Copy link
Author

Hi @leodido @fntlnz , just wanted to checking in. Will the next release include fix for this bug? And when is the release due?

@ahmed1smael
Copy link

ahmed1smael commented Mar 10, 2020

@fntlnz Thanks to updating this issue pls :)

@fntlnz
Copy link
Contributor

fntlnz commented Mar 18, 2020

Anyone who wants to give the fix a try, you can help us by testing the artifacts here #1099 (comment)

Remember to configure Falco to use the new flag or nothing will change!

@nebi-frame
Copy link
Author

Amazing! thank you so much guys we really appreciate your efforts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants