Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add complete documentation #1045

Open
calestyo opened this issue Jul 4, 2024 · 11 comments
Open

add complete documentation #1045

calestyo opened this issue Jul 4, 2024 · 11 comments

Comments

@calestyo
Copy link

calestyo commented Jul 4, 2024

Hey.

It would be nice if there was something like a write_to_stdout() function, or if write_to_textfile() would somehow allow writing to stdout.

For debugging that would seem to be much more helpful. Also, distros may have their own unified means of collecting the .prom files and thus even in production it may make sense to let a program simply write the metrics to stdout.

Cheers,
Chris.

PS: Using /dev/stdout doesn't work, as it tries to write it's tmpfile there.

@calestyo
Copy link
Author

calestyo commented Jul 6, 2024

After reading the code I found generate_latest() which seems to do what I want.

So the actual problem seems rather that the whole Python module is basically not documented at all.

https://prometheus.github.io/client_python/ can at best count as a quick how-to, but even that doesn't seem to cover the most simplest and probably regularly used stuff (like clearing time series with labels that no longer exist), not to talk about general concepts like the registries or proper documentation of all classes, methods and their params. :-(

@calestyo
Copy link
Author

calestyo commented Jul 6, 2024

Re-dedicating this issue, asking for some more complete documentation ;-)

@calestyo calestyo changed the title write_to_stdout()? add complete documentation Jul 6, 2024
@calestyo
Copy link
Author

Another area that seems completely undocumented is CustomCollectors - not only the class itself, but also the concepts, when/for what they're used and how they work.

The example given at https://prometheus.github.io/client_python/collector/custom/ doesn't even work, as prometheus_client.core no longer exits (guess it was renamed to metrics_core at some point.

There's no telling that CustomCollectors are automatically called (and do their thing in collect()) whenever the exporter is scraped.

In contrast to direct instrumentation, where the metrics are updated "on the fly" in the code, regardless of whether the exporter is even scraped at all.

No word about what "direct instrumentation" even means.

No word about what describe() is used for, which would be handy, if one e.g. writes an exporter that shall support both modes:

  • running continuously and being queryable via http
  • one-shot mode, and e.g. being usable with something like node-exporter-collector

I one wants to do both, then it seems generate_latest() also calls the custom collectors, but with the default registry it does so twice (probably because of auto describe), which is of course rather undesired in one-shot mode.

(I had to gather all that information from blogs and 3rd party websites... all doing things a bit different... some making collect() a generator with yield others not.

Having basically no proper documentation is really a pity, as the Python client seems actually to be quite neat, but if people have to read the whole sources or try their luck on google with questionable outcome, they probably simply won't use all the nice features in a proper way.

@csmarchbanks
Copy link
Member

#1021 is another issue focusing on API documentation that would hopefully make things like generate_latest more discoverable, that said the custom collector documentation would be more than just API documentation and would be nice to have in addition. Perhaps this issue could focus on that?

I hesitate to have an issue for just "complete documentation" since that is a moving target as more features are added.

@calestyo
Copy link
Author

It's at least difficult IMO to have many small issues for just single points that seem to be missing from the documentation, because if I add another issue for every thing I find presumably missing, I'd spam the issue tracker ;-)

Like there was the point of yield (as used by the example - though I actually don't understand why that would work if collect() is called only once per scrape) vs. return (as used in the default collectors shipped in the package) I've already mentioned above.

It would further e.g. be nice to document that:

  • With direct instrumentation - I presume - only single metrics are set atomically. In the sense that if one has:
    gauge_a.set(34)
    gauge_b.set(66)
    
    and these - via labels (used like primary keys) - somehow relate to each other, a scrape in-between may see of course gauge_a set but guage_b not yet.
    Perhaps also mentioning that it's guaranteed that every single metric is either fully set or not - i.e. it cannot happen that e.g. one label is already set, the value, too, but a 2nd label is still empty.
  • With a custom collector OTOH, it seem that one can set multiple different metrics in an atomic way, i.e. any scrapes wait until the collect() has run through.

Another thing, which lacks documentation is IMO, what describe() shall actually return. As mentioned above, I have a use-case where I want to prevent collect() being run just for auto describe.
It's also not really clear what happens if there is simply no describe() function and not auto describe... I mean will it just fail? Will some functionallty be missing, which?

And there are many little bits and pieces:

And at least I'm kinda lacking some more concrete best-practises on how do design metrics. I know there are a few documents on that, but they're rather very abstract IMO.

I've had written a post with some more concrete questions here and another one in the same place, though that got caught by the spam filter and so far not moderator was willing to approve it.

@calestyo
Copy link
Author

calestyo commented Jul 11, 2024

What's IMO also confusing is:
For the custom collector metric classes (CounterMetricFamily, etc.).

My understanding was that each object of e.g. CounterMetricFamily is one metric (at last if gone goes by the terminology used in the Prometheus documentation), which can have multiple samples for different values of the possible labels.

Here however, MetricFamily seems to be what Prometheus calls metric and add_metric seems to set the most recent sample of timeseries for a metric+labelset combination.

Edit:
For reference:

Prometheus fundamentally stores all data as time series: streams of timestamped values belonging to the same metric and the same set of labeled dimensions.

Every time series is uniquely identified by its metric name and optional key-value pairs called labels.

I would interpret this as e.g. node_filesystem_files is the “metric” (not metric family)... and every metric may have many timeseries, which are identified by their label(values).

So it feels that *MetricFamily and add_metric() are a bit of a misnomer.


Also, it's unclear why/when one would use __init__()’s value= and when .add_metric(…, value=). I haven't followed the code to the end, but would guess that .add_metric(…, value=) needs to be used whenever one has different labelsets per metric, and the constructor’s value= may be used when there are no lables (and perhaps also when one simply uses the empty value for all labels?)?

@calestyo
Copy link
Author

Yet another ;-) ... part that should IMO be documented is the following:

With direct instrumentation the typical usage seems to be:

  • first create once the metric object at some place
  • then, modify that object with .set(), .inc(), etc.

With custom collectors, one could, too do something like above e.g. as in:

import prometheus_client
import sys
import os
import time

r=prometheus_client.CollectorRegistry()

class MyCollector(prometheus_client.registry.Collector):
    def __init__(self, registry):
        self.foo = 0
        registry.register(self)
        self.tm = prometheus_client.metrics_core.CounterMetricFamily('my_time','Current time',labels=['host', 'foo'])

    def collect(self):
        print("collect called", file=sys.stderr)
        self.foo +=1
        tm = self.tm

        tm.add_metric([ os.uname()[1] ], value=self.foo)
        if self.foo % 2 == 0:
            tm.add_metric([ "foo" ], value=666)

        return [tm]


MYCOLLECTOR = MyCollector(r)

prometheus_client.start_http_server(8000, registry=r)
time.sleep(100000000)

where I create one CounterMetricFamily in __init__() and always call .add_metric() on that.

But the outcome of that is at best unexpected:

First scrape:

# HELP my_time_total Current time
# TYPE my_time_total counter
my_time_total{host="heisenberg"} 1.0

Second scrape:

# HELP my_time_total Current time
# TYPE my_time_total counter
my_time_total{host="heisenberg"} 1.0
my_time_total{host="heisenberg"} 2.0
my_time_total{host="foo"} 666.0

Third scrape:

# HELP my_time_total Current time
# TYPE my_time_total counter
my_time_total{host="heisenberg"} 1.0
my_time_total{host="heisenberg"} 2.0
my_time_total{host="foo"} 666.0
my_time_total{host="heisenberg"} 3.0

Seems like it would just store all the samples and print all... even for the same metric+labelset combinations.

I wasn't even aware that such "re-definitions" are allowed.

Instead, it seems, that – unlike with the direct instrumentation usage – one has to always re-create the metric objects like in:

import prometheus_client
import sys
import os
import time

r=prometheus_client.CollectorRegistry()

class MyCollector(prometheus_client.registry.Collector):
    def __init__(self, registry):
        self.foo = 0
        registry.register(self)

    def collect(self):
        print("collect called", file=sys.stderr)
        self.foo +=1
        tm = prometheus_client.metrics_core.CounterMetricFamily(
            'my_time',
            'Current time',
            labels=['host', 'foo'],
        )

        tm.add_metric([ os.uname()[1] ], value=self.foo)
        if self.foo % 2 == 0:
            tm.add_metric([ "foo" ], value=666)

        return [tm]


MYCOLLECTOR = MyCollector(r)

prometheus_client.start_http_server(8000, registry=r)
time.sleep(100000000)

in order to get reasonable results.

@greatflyingsteve
Copy link

Occasionally you find some issues complaining about things and nobody from the project says anything, and you can see the requestor just kinda go off on their own for a while and gradually make less and less sense, clearly on a tear about something only they care about. I think this is the first time I've seen someone do exactly the opposite of that, and @calestyo, thank you so much for fleshing out this issue - there is more legitimate discussion of how the hell anyone might actually use this module in this thread than there is in the official documentation.

The docs for this module read like a "clever" person more invested in showing off how clean and clever their solution was than in demonstrating how it might be useful to anyone. The examples are childishly simple, so I suppose it's possible that the framework is actually clean; but they're also so woefully incomplete that it's entirely unclear how anything in the examples results in a recognizeable label or metric of any kind. Some of the object methods appear to be created by the example code, but it's really difficult to be sure because nothing is ever explained. Why I would want to create a method by that name is anyone's guess; even the "three-step demo" only registers two values and then describes how they're used incorrectly to show nothing of value.

In other words, the docs are a lot of "look what I can do," when the point of good documentation ought to be "look what you can do." The fact that a user of the framework can casually, in fits and starts over the course of a single week, actually put more useful information into the ticketing system for the project than the project has ever published in its own literal documentation ought to be a point of deep shame. In over twenty-five years' experience in the open source arena, this is some of the worst documentation I have ever seen, to the point that it might actually be an improvement to have none at all.

@calestyo
Copy link
Author

@greatflyingsteve Please keep in mind that this is an open source project and any efforts (development and documentation) are provided for free.

There's IMO no reason to assume any bad faith from the side of the developers respectively the writers of documentation,… if you’ve been in open source since many years, you probably know that – as a developer – it’s often actually quite difficult to write good documentation, cause one is already an expert in the system and it’s not obvious what questions arise from a "user" PoV.

Typically it also doesn’t help encouraging people to make further contributions, if they’re blamed with harsh words (to say it diplomatically) like "deep shame".

@greatflyingsteve
Copy link

greatflyingsteve commented Feb 20, 2025

@calestyo I understand what you mean, and I appreciate your civility. What I'm trying to get across - besides my obvious frustration, that is - is very specific, though. I write code. It's not very good code, but I write code, or I wouldn't be here. But I also write stories, and forum posts, and tons of other things. The challenge in good writing isn't saying enough, the challenge is not saying too much. From most developers whose docs I've read, your point that "one is already an expert in the system" is spot on: they write at high granularity, and tend to over-describe or dive off into all the cogs and wheels that make their project work. If they're bad at docs (but do write them), the docs have so much detail that they're only half a level removed from reading the code itself.

In the same vein, if I can write a thousand words on a some topic in an hour, writing six hundred words on the same topic takes two hours. Opening the floodgates is easy; pulling the verbosity back out is hard. It takes editing. These docs are "bad" in a way that takes effort. They have been edited down to the point of being maddeningly terse. They focus on use cases involving poorly-understood (but incredibly cool) language mechanics, but then say nothing at all about how or why that method works, or even what it does. There are code examples, but intentionally bad and useless examples that have no resemblance to a real use case, barely describe the outcome, and say nothing about why or how this is useful. In fact, most of the examples contain no output, such that it's almost impossible to tell what the example might actually do, or how the visible primitives actually translate to anything in the Exposition Format.

I'm not saying there's bad faith, but there is a specific flavor of incompleteness that feels like there is, clearly written between the lines, a message that "you might figure this out if you're as clever as I am," in the same way that people who craft advanced logic puzzles intentionally provide only the smallest stubs of a clue. The difference, of course, is that people trying to solve logic puzzles are signing up for the difficulty, and find it desirable. If one is instead making tools, additional difficulty is a bug, not a feature; and this is, very decidedly, a tool.

The project I was so frustrated about when I wrote previously is done. The results I wanted turned out to be easier than expected. The library itself is really good, but I struggled harder for having read the docs instead of the code, and I was already familiar with both Prometheus and Python. It would have been easier to use overly-specific docs generated straight out of the docstrings. Unwittingly signing up to solve logic puzzles with intentionally minimized stubs, when one should reasonably expect instead to find clues in plenty, is a frustrating experience.

I am almost frustrated enough to go fork and come back with a PR.

@csmarchbanks
Copy link
Member

This is an open source project ran by volunteers and contributors. It is not helpful to say things like "intentionally bad and useless" as that is certainly not the intention by anyone.

I am almost frustrated enough to go fork and come back with a PR.

If the docs are frustrating to you, please open a PR to improve them! It shouldn't require you to be "frustrated enough" to improve something, we are always open to improving this library, and docs are a great place to start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants