Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release: v1.3.0 #2069

Merged
merged 37 commits into from
Nov 9, 2022
Merged

release: v1.3.0 #2069

merged 37 commits into from
Nov 9, 2022

Conversation

abernix
Copy link
Member

@abernix abernix commented Nov 9, 2022

🚀 Features

Add support for DHAT-based heap profiling (PR #1829)

The dhat-rs crate provides DHAT-style heap profiling. We have added two compile-time features, dhat-heap and dhat-ad-hoc, which leverage this ability.

By @garypen in #1829

Add trace_id in logs to correlate entries from the same request (Issue #1981)

A trace_id is now added to each log line to help correlate log entries to specific requests. The value for this property will be automatically inherited from any enabled distributed tracing headers, such as those listed in our Tracing propagation header documentation (e.g., Jaeger, Zipkin, Datadog, etc.).

In the event that a trace_id was not inherited from a propagated header, the Router will originate a trace_id and also propagate that value to subgraphs to enable tracing in subgraphs.

Here is an example of the trace_id appearing in plain-text log output:

2022-10-21T15:17:45.562553Z ERROR [trace_id=5e6a6bda8d0dca26e5aec14dafa6d96f] apollo_router::services::subgraph_service: fetch_error="hyper::Error(Connect, ConnectError(\"tcp connect error\", Os { code: 111, kind: ConnectionRefused, message: \"Connection refused\" }))"
2022-10-21T15:17:45.565768Z ERROR [trace_id=5e6a6bda8d0dca26e5aec14dafa6d96f] apollo_router::query_planner::execution: Fetch error: HTTP fetch failed from 'accounts': HTTP fetch failed from 'accounts': error trying to connect: tcp connect error: Connection refused (os error 111)

And an exmaple of the trace_id appearing in JSON-formatted log output in a similar scenario:

{"timestamp":"2022-10-26T15:39:01.078260Z","level":"ERROR","fetch_error":"hyper::Error(Connect, ConnectError(\"tcp connect error\", Os { code: 111, kind: ConnectionRefused, message: \"Connection refused\" }))","target":"apollo_router::services::subgraph_service","filename":"apollo-router/src/services/subgraph_service.rs","line_number":182,"span":{"name":"subgraph"},"spans":[{"trace_id":"5e6a6bda8d0dca26e5aec14dafa6d96f","name":"request"},{"name":"supergraph"},{"name":"execution"},{"name":"parallel"},{"name":"fetch"},{"name":"subgraph"}]}
{"timestamp":"2022-10-26T15:39:01.080259Z","level":"ERROR","message":"Fetch error: HTTP fetch failed from 'accounts': HTTP fetch failed from 'accounts': error trying to connect: tcp connect error: Connection refused (os error 111)","target":"apollo_router::query_planner::execution","filename":"apollo-router/src/query_planner/execution.rs","line_number":188,"span":{"name":"parallel"},"spans":[{"trace_id":"5e6a6bda8d0dca26e5aec14dafa6d96f","name":"request"},{"name":"supergraph"},{"name":"execution"},{"name":"parallel"}]}

By @bnjjj in #1982

Reload configuration when receiving the SIGHUP signal (Issue #35)

The Router will now reload its configuration when receiving the SIGHUP signal. This signal is only supported on *nix platforms,
and only when a configuration file was passed to the Router initially at startup.

By @Geal in #2015

🐛 Fixes

Fix the deduplication logic in deduplication caching (Issue #1984)

Under load, we found it was possible to break the router de-duplication logic and leave orphaned entries in the waiter map. This fixes the de-duplication logic to prevent this from occurring.

By @garypen in #2014

Follow back-off instructions from Studio Uplink (Issue #1494 Issue #1539)

When operating in a Managed Federation configuration and fetching the supergraph from Apollo Uplink, the Router will now react differently depending on the response from Apollo Uplink, rather than retrying incessantly:

  • Not attempt to retry when met with unrecoverable conditions (e.g., a Graph that does not exist).
  • Back-off on retries when the infrastructure asks for a longer retry interval.

By @Geal in #2001

Fix the rhai SDL print function (Issue #2005)

Fixes the print function exposed to rhai which was broken due to a recent change that was made in the way we pass SDL (schema definition language) to plugins.

By @fernando-apollo in #2007

Export router_factory::Endpoint (PR #2007)

We now export the router_factory::Endpoint struct that was inadvertently unexposed. Without access to this struct, it was not possible to implement the web_endpoints trait in plugins.

By @scottdouglas1989 in #2007

Validate default values for input object fields (Issue #1979)

When validating variables, the Router now uses graph-specified default values for object fields, if applicable.

By @Geal in #2003

Address regression when sending gRPC to localhost (Issue #2036)

We again support sending unencrypted gRPC tracing and metrics data to localhost. This follows-up on a regression which occurred in the previous release which addressed a limitation which prevented sending gRPC to TLS-secured endpoints.

Applying a proper fix was complicated by an upstream issue (opentelemetry-rust#908) which incorrectly assumes https in the absence of a more-specific protocol/schema, contrary to the OpenTelmetry specification which indicates otherwise.

The Router will now detect and work-around this upstream issue by explicitly setting the full, correct endpoint URLs when not specified in config.

In addition:

  • Basic TLS-encyrption will be enabled when the endpoint scheme is explicitly https.
  • A warning will be emitted if the endpoint port is 443 but no TLS config is specified since most traffic on port 443 is expected to be encrypted.

By @bryncooke in https://github.com/apollographql/router/pull/#2048

🛠 Maintenance

Apply Tower best-practice to "inner" Service cloning (PR #2030)

We found our Service readiness checks could be improved by following the Tower project's recommendations for cloning inner Services.

By @garypen in #2030

Split the configuration file implementation into modules (Issue #1790)

The internals of the implementation for the configuration have been modularized to facilitate on-going development. There should be no impact to end-users who are only using YAML to configure their Router.

By @Geal in #1996

Apply traffic-shaping directly to supergraph and subgraph (PR #2034)

The plugin infrastructure works on BoxService instances and makes no guarantee on plugin ordering. The traffic shaping plugin needs a clonable inner service, and should run right before calling the underlying service. We'e changed the traffic plugin application so it can work directly on the underlying service. The configuration remains the same since this is still implemented as a plugin.

By @Geal in #2034

📚 Documentation

Remove references to Git submodules from DEVELOPMENT.md (Issue #2012)

We've removed the instructions from our development documentation which guide users to familiarize themselves with and clone Git submodules when working on the Router source itself. This follows-up on the removal of the modules themselves in PR #1856.

By @garypen in #2045

abernix and others added 30 commits October 25, 2022 15:22
I was running into issues when copying the example from the initial docs
due to function pointers in Rhai can't be referred to imported modules,
so reworked the example per the [rhai
docs](https://rhai.rs/book/language/fn-ptr.html?highlight=fn%20pointer#warning--global-namespace-only).
Recent changes to the storage of the SDL (in an `Arc`) meant it was no longer possible to print from a rhai script. This fixes that problem.
…1982)

close #1981 

It automatically adds a `trace_id` on logs to identify which log is
related to a specific request. Also adds `trace_id` to an example error
to have an example.

This PR also refactor the way we handled formatters before, this
refactor will let us be more flexible on which specific fields we want
to display in logs. For now it keeps the current behavior which do not
display any fields coming from current span and parent spans, only
fields written directly in the log macro.

Example of an error response:
```json
{
  "data": {},
  "errors": [
    {
      "message": "HTTP fetch failed from 'accounts': HTTP fetch failed from 'accounts': error trying to connect: tcp connect error: Connection refused (os error 111)",
      "path": [],
      "extensions": {
        "type": "SubrequestHttpError",
        "service": "accounts",
        "reason": "HTTP fetch failed from 'accounts': error trying to connect: tcp connect error: Connection refused (os error 111)"
      }
    }
  ]
}
```

with response header:
```
apollo_trace_id: 5e6a6bda8d0dca26e5aec14dafa6d96f
```

with related logs:

```logs
2022-10-21T15:17:45.562553Z ERROR [trace_id=5e6a6bda8d0dca26e5aec14dafa6d96f] apollo_router::services::subgraph_service: fetch_error="hyper::Error(Connect, ConnectError(\"tcp connect error\", Os { code: 111, kind: ConnectionRefused, message: \"Connection refused\" }))"
2022-10-21T15:17:45.565768Z ERROR [trace_id=5e6a6bda8d0dca26e5aec14dafa6d96f] apollo_router::query_planner::execution: Fetch error: HTTP fetch failed from 'accounts': HTTP fetch failed from 'accounts': error trying to connect: tcp connect error: Connection refused (os error 111)
```

Signed-off-by: Benjamin Coenen <5719034+bnjjj@users.noreply.github.com>
…pace

Since this test is typically run together with other tests,
the apollo-router and its dependencies will be already built.
This sharing avoids recompiling some of the dependencies,
but unfortunately not all because of some differences in feature selection.

For local development however, repeated runs of the test will be much faster
since the target directory is preserved across runs.

RUSTFLAGS is replaced with a crate attributes so that it doesn’t affected
dependencies (which then don’t all need to be recompiled)
Fix #1494 
Fix #1539 

The Uplink response must drive the behaviour of the schema downloader.
Some errors indicate we should not retry.
Signed-off-by: Benjamin Coenen <5719034+bnjjj@users.noreply.github.com>
Close examination revealed that some judicious tweaking of the locking
and spawning would improve the behaviour.

In performance testing this performs the same as current (dev) or using
the deduplicate crate. I can't reproduce the hanging problem with dev
with this set of changes.

The rationale for the changes are as follows:

- We need to spawn the sentinel **before** we insert the waiter entry or
failure after inserting, but before spawning a sentinel, will cause
issues.
- Notification of waiters from the "first" entry **must be
synchronised** under the "wait_map" lock, or else that becomes racy and
subscribers may be missed. That notification must be immediately
preceded by the removal of the wait_map entry.

fixes: #1984
The last change introduced a bug which meant you can't use the script to
build a docker image from released code. This fixes it.
adding missing escape chars to the code block
adding missing escape chars to the code block
In the past, we didn't have aarch64 binaries. We hacked our install
script so that we downloaded x86_64 binaries and relied on magic (aka
chip emulation) to run the binary.

We build aarch64 binaries now, so we need to update the installer to
install them and give a performance boost to our linux on aarch64 user
base.

I manually tested the change by running a debian docker image on my M1
laptop and using the modified script to install a router into it.
Add support for DHAT heap profiling via dhat features: `dhat-heap` and
`dhat-ad-hoc`.

This isn't intended for "general" consumption, but for router developers
who are trying to understand and resolve potential memory issues.

Co-authored-by: Simon Sapin <simon@apollographql.com>
Co-authored-by: Jesse Rosenberger <git@jro.cc>
The long pole on our build is the arm build/test step. Also, running
everything in parallel is expensive.

Let's address both of those problems:

 - require lint to succeed before starting build/test jobs
 - give more resources to arm jobs to allow them to complete faster
- tweak the build setting for both arm and amd to use resources more
effectively

Result:
- total build time is reduced by ~15% in my test runs (even with waiting
for lint first)
- billing should be reduced by less renovate waste-fullness when lint
fails
Allows router plugins to utilize `web_endpoints` by making the `Endpoint` struct public
Fix #1979

When validating variables, we should use default values for object
fields if applicable
Fix #35 

This adds support for reloading configuration when receiving the SIGHUP
signal. This only works on unix-like platforms, and only with the
configuration file

Co-authored-by: Gary Pennington <gary@apollographql.com>
[![Mend
Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
| openzipkin/zipkin | patch | `2.23.18` -> `2.23.19` |

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined),
Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.

♻ **Rebasing**: Renovate will not automatically rebase this PR, because
other commits have been found.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR has been generated by [Mend
Renovate](https://www.mend.io/free-developer-tools/renovate/). View
repository job log
[here](https://app.renovatebot.com/dashboard#github/apollographql/router).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNC45LjEiLCJ1cGRhdGVkSW5WZXIiOiIzNC45LjEifQ==-->

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: Geoffroy Couprie <geoffroy@apollographql.com>
…2030)

When I was investigating connection resets I noted that we weren't
following tower documentation with respect to [cloning inner
services](https://docs.rs/tower/latest/tower/trait.Service.html#be-careful-when-cloning-inner-services).

This PR addresses that. The change to the supergraph seems to be most
invasive. I have run the tests manually many times and the change seems
good, but probably worth closer examination.
also: fix bracketing in NEXT_CHANGELOG.md

fixes: #2012
There is a bug in otel
open-telemetry/opentelemetry-rust#908 that did
not manifest until `tls-roots` was enabled on `tonic`.

This PR introduces a workaround until this is fixed upstream.

default endpoint is now configured to use http rather than https. In
addition, tls domain will only be set if the endpoint scheme is https.
If the endpoint port is 443 ta warning is displayed if TLS has not been
configured.

Co-authored-by: bryn <bryn@apollographql.com>
SimonSapin and others added 6 commits November 7, 2022 11:16
The plugin infrastructure works on `BoxService` instances, and makes no
guarantee on plugin ordering.
The traffic shaping plugin needs a clonable inner service, and should
run right before calling
the underlying service. So this changes the traffic plugin application
so it can work directly
on the underlying service. It is still a plugin though, so it keeps the
same configuration.
To actively discourage folks from using `--dev` in production, this
explains how to cherry-pick parts of dev mode for a production config.
Signed-off-by: Benjamin Coenen <5719034+bnjjj@users.noreply.github.com>
@abernix abernix added the release label Nov 9, 2022
@abernix abernix mentioned this pull request Nov 9, 2022
@abernix abernix marked this pull request as ready for review November 9, 2022 08:40
@abernix abernix self-assigned this Nov 9, 2022
This PR makes the appropriate changes in preparation for releasing:

1. Bumps versions in preparation for release
2. Does a straight copy of `NEXT_CHANGELOG.md` to top of `CHANGELOG.md`
3. Does initial editorial on `CHANGELOG.md` prior to initial review (the
bulk of the work)

Co-authored-by: Gary Pennington <gary@apollographql.com>
Copy link
Contributor

@garypen garypen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nice thing about reviewing like this is seeing exactly how much work we are putting into each release!

@abernix abernix merged commit bd9035f into main Nov 9, 2022
@abernix abernix deleted the 1.3.0 branch November 9, 2022 12:08
abernix added a commit that referenced this pull request Nov 9, 2022
Following up on:

- #2069 

By merging `main` back into `dev`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.