Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[amtool] - Add a new silence import command #1082

Merged
merged 2 commits into from
Dec 7, 2017

Conversation

lebinh
Copy link
Contributor

@lebinh lebinh commented Nov 6, 2017

This command read silences data from a query JSON output and import to
alertmanager. It allows amtool to be used as a backup/restore tool for
silences, i.e. fix #1000

Backup / export:

amtool silence -o json > silences.json

Restore / import:

amtool silence import silences.json

This adds new silence one by one and will be slow if you have thousands of silences but alertmanager API doesn't allow multiple silences in POST so ¯\(ツ)

for _, silence := range silences {
// reset the ID, otherwise alertmanager API will try to replace an existing silence
// this *might* create duplicated silences but that should generally be OK
silence.ID = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we wan't to avoid duplicate silences if possible. Does the alertmanager API accept posting a new silence with the ID field present?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will accept if the silence ID exists and will replace it with the new one. But if the ID doesn't exist then it will return an error, not found. We can check and only remove ID after receive "not found" error but that will create a lot of extra round trip in the backup / restore use case (as every add will have to check and wait for a not found error first).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, don't think we should have duplicated silences as well. Maybe skip the import if silence already exist?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is for disaster recovery/data migration, I'm assuming maintaining the IDs wouldn't be important (and wouldn't expect there to be any pre-existing silences in the new instance of AM).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a flag to force this, it is need as we are running Alertmanager 0.6 and thus cannot update an existing silences. I will keep the silence ID by default as it appear Alertmanager will only update existing one without creating any new silence in that case.

@lebinh
Copy link
Contributor Author

lebinh commented Nov 27, 2017

Hi @stuartnelson3, @Kellel can you guys help take a look and see if we can merge this PR?

@stuartnelson3
Copy link
Contributor

Hey sorry for the long delay, I've been away for the last few weeks and have been behind on reviewing things. I'll take a look at this in the next few days, thanks for your patience!

@lebinh lebinh changed the title Add a new silence import command to amtool [amtool] - Add a new silence import command to amtool Nov 27, 2017
@lebinh
Copy link
Contributor Author

lebinh commented Nov 27, 2017

Sure @stuartnelson3 no problem, thank you for maintaining this.

@lebinh lebinh changed the title [amtool] - Add a new silence import command to amtool [amtool] - Add a new silence import command Nov 27, 2017
var err error

if len(args) == 1 {
input, err = os.Open(args[0])
Copy link
Contributor

@josedonizetti josedonizetti Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to close the opened fd. defer input.Close()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be added after the err is checked below

var err error

if len(args) == 1 {
input, err = os.Open(args[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be added after the err is checked below

}
}

data, err := ioutil.ReadAll(input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ioutil.ReadAll will load the entire input into memory. It would be better to process the information as a stream.

See https://golang.org/pkg/encoding/json/#example_Decoder_Decode_stream

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it ended up being a super easy change:

dec := json.NewDecoder(input)

t, err := dec.Token()
if err != nil {
	return err
}

for dec.More() {
	var s types.Silence
	err := dec.Decode(&s)
	if err != nil {
		log.Fatal(err)
	}

	s.ID = ""

	err = addSilence(s)
	if err != nil {
		msg := fmt.Sprintf("couldn't add silence: %s", s)
		return errors.Wrap(err, msg)
	}
}

t, err = dec.Token()
if err != nil {
	return err
}

return nil

// this *might* create duplicated silences but that should generally be OK
silence.ID = ""

err = addSilence(silence)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is being done in serial, it will be relatively slow. We can always create a small worker pool (https://gobyexample.com/worker-pools) without much trouble. I'm fine adding it later, but if you want to add it now, feel free.

}

func bulkImport(cmd *cobra.Command, args []string) error {
input := os.Stdin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also accepting a json stream on stdin is 👌

@lebinh
Copy link
Contributor Author

lebinh commented Dec 5, 2017

Sorry for the close / re-open as I pushed the wrong commit 😞
I added a new flag, force to remove the silence ID before calling add API. The default behavior, i.e. without --force, is to only remove that ID and re-try if we received "not found" error result.

There is also a new flag --worker to specify the number of concurrent workers calling Alertmanager API.

This command read silences data from a query JSON output and import to
alertmanager. It allows `amtool` to be used as a backup/restore tool for
silences, i.e. prometheus#1000

Backup / export:

```
amtool silence -o json > silences.json
```

Restore / import:

```
amtool silence import silences.json
```
@lebinh
Copy link
Contributor Author

lebinh commented Dec 5, 2017

Probably worth mention that theforce flag can also be used to improve speed of import in case of new silences:

$ time ./amtool silence import -w 4 silences.json
...
real	0m6.188s
user	0m0.247s
sys	0m0.087s

$ time ./amtool silence import -w 4 -f silences.json
...
real	0m3.549s
user	0m0.218s
sys	0m0.068s

The end results (with or without -f) are the same but the difference in speed is because we don't have to wait for a network roundtrip of "not found" error and then re-try.

@stuartnelson3
Copy link
Contributor

Awesome work! I'll give this a try tomorrow morning.

Copy link
Contributor

@stuartnelson3 stuartnelson3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the worker pool! Unfortunately, the way it's currently constructed, it will deadlock if you try to import more than 100 silences. The error channel isn't drained until AFTER all silences have been created, so the send to errc in addSilenceWorker will block.

I've added some code comments inline which fixes the problem.

}

silences := make(chan *types.Silence, 100)
errs := make(chan error, 100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rename this to errc?

return errors.Wrap(err, "couldn't unmarshal input data, is it JSON?")
}

silences := make(chan *types.Silence, 100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go convention for naming channels is typically to name something and then put c after it (to indicate channel). could you change this to silencec?

}
for w := 0; w < workers; w++ {
go addSilenceWorker(silences, errs)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These three lines can be updated to help prevent a deadlock:

var wg sync.WaitGroup
for w := 0; w < workers; w++ {
	go func(w int) {
		wg.Add(1)
		addSilenceWorker(silc, errc)
		wg.Done()
	}(w)
}

go func() {
	for err := range errc {
		if err != nil {
			errCount++
		}
	}
}()

Check out down below where we wg.Wait(), which indicates that all the silence workers are finished, and then we can close(errc).

silences <- &s
count++
}
close(silences)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final piece is here:

close(silencec)
wg.Wait()
close(errc)

once we're done sending all the parsed silences, we can close the channel we're ranging over (anything still in the channel waiting to be processed will be processed). from there, we wg.Wait() to know all the silences have been created and their errors sent on the error channel, and then we can close(errc).

if err != nil {
errCount++
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code should be removed, it has been placed into the goroutine further up.

Move error channel reading to a goroutine to prevent deadlock and thus
add a WaitGroup to synchronize.
@lebinh
Copy link
Contributor Author

lebinh commented Dec 6, 2017

Thanks @stuartnelson3 these feedbacks are very helpful as I only started with Go recently. I've updated the PR as suggested.

// read closing bracket
_, err = dec.Token()
if err != nil {
return errors.Wrap(err, "invalid JSON")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this as I think it doesn't really make sense to throw an error over JSON format here, the silences are added already at this point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually prefer to keep it in. Even though all the silences have been added, a user should still be informed if they are attempting to use invalid json.

@stuartnelson3
Copy link
Contributor

I'll take a look at this tomorrow to confirm it's working, thanks for the work on this

Copy link
Contributor

@stuartnelson3 stuartnelson3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, works awesome! Took 3sec to insert 10,000 silences (and half that with --force)

@stuartnelson3 stuartnelson3 merged commit 9b12714 into prometheus:master Dec 7, 2017
@lebinh
Copy link
Contributor Author

lebinh commented Dec 8, 2017

Great! Thanks @stuartnelson3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide silence backup and restore functionality in amtool cli tool
4 participants