Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(monitoring): add dns reporter #1376

Merged
merged 7 commits into from
Jun 27, 2024

Conversation

kishore03109
Copy link
Contributor

@kishore03109 kishore03109 commented May 15, 2024

Problem

This is a first pr that is up to add some level of sane reporting.
While scheduling is part of this feature, it is not within the scope of this pr. This pr only adds (currently dead code) logic to grab the domains that we own in isomer, and do a dns dig. This is meant to be verbose, and in the future alarms can be added based on the results of this.

This is not meant to replace monitoring, it is just meant to fine tune some blind spots that uptime robot currently has + some sane checker during incident response to show history of dns records for a site that we manage.

I am opting to log it directly in our backend to keep things simple. will add alarms + the scheduler in subsequent prs.

Solution

grab ALL domains from keycdn + amplify + redirection records + log dns records on them.

Breaking Changes

  • Yes - this PR contains breaking changes
    • Details ...
  • No - this PR is backwards compatible with ALL of the following feature flags in this doc

Tests

in server.ts add:
monitoringService.driver()

should see this in the logs:

Screenshot 2024-05-15 at 5.48.05 PM.png

Deploy Notes

New environment variables:

  • KEYCDN_API_KEY : to get all the zones that we own in keycdn
  • REDIRECTION_REPO_GITHUB_TOKEN: gh token to view redir repo
    • HAVE NOT added env var to 1PW + SSM script

(fetch_ssm_parameters.sh)

New scripts:

  • script : script details

New dependencies:

  • dependency : dependency details

New dev dependencies:

  • dependency : dependency details

Copy link
Contributor Author

kishore03109 commented May 15, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @kishore03109 and the rest of your teammates on Graphite Graphite

@kishore03109 kishore03109 force-pushed the 05-15-feat_monitoring_add_dns_reporter branch 4 times, most recently from 48fd2c3 to 071ff12 Compare May 15, 2024 10:25
@kishore03109 kishore03109 marked this pull request as ready for review May 15, 2024 10:27
@kishore03109 kishore03109 requested a review from a team May 15, 2024 10:27
@kishore03109 kishore03109 force-pushed the 05-15-feat_monitoring_add_dns_reporter branch from 071ff12 to 9df257d Compare May 17, 2024 05:57
Copy link
Contributor

@seaerchin seaerchin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the structure here suffers abit from mixing concerns of different levels. what could be better is if these were split up and it's clear what each function is focusing on.

}

interface KeyCdnResponse {
data: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this data key seems abit extra? maybe the method returning this should just prune data so that callers downstream don't have to prune it themselves

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interface RedirectionDomain {
source: string
target: string
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a big deal but we do already have something similar (DNSRecord) in siteInfo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this one maps directly to what our csv format is in our github directory, so if we add target, papaparse throws an error

const keyCdnApiKey = config.get("keyCdn.apiKey")

return ResultAsync.fromPromise(
fetch(`https://api.keycdn.com/zonealiases.json`, {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use fetch over axios? take note that fetch is experimental and the associated status page recommends not using experimental APIs in production applications

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, swapped to using axios

src/monitoring/index.ts Outdated Show resolved Hide resolved
src/monitoring/index.ts Outdated Show resolved Hide resolved
]).andThen(([amplifyDeployments, redirectionDomains, keyCdnDomains]) => {
this.monitoringServiceLogger.info("Fetched all domains")
return okAsync(
[...amplifyDeployments, ...redirectionDomains, ...keyCdnDomains].sort(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole chunk is pretty confusing to read as there's alot of repetition and magic numbers being used; wdyt about using sortBy and just stripping www. from both domainA and domainB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this one might want to push back.

the point of this was to remove blind spots in our monitoring, which means we want to see all the domains that we host, this includes the www and the non www. This is just a utility function to group the root the subdomain and the apex domain together, and the rest sort alphabetically.

Copy link
Contributor

@seaerchin seaerchin May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this might be a miscommunication - since our comparison is always based on the stripped length (ie, www.isomer.gov.sg is taken as equivalent to isomer.gov.sg), why not just use sortBy(array, val => val.startsWith("www.") ? val.slice(4) : val.

in this case, our comparator compares via the stripped value so both www.isomer.gov.sg will be treated identically to isomer.gov.sg. lmk if this makes sense

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oo this is neater code, thanks

Comment on lines +269 to +221
.map((value) =>
reportCard.push({
...value,
})
)
.andThen(() => okAsync(reportCard))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if i'm reading this correctly but if we're doing a map that's just a push, what's stopping us from just returning this directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm concretely what are you suggesting ah? if we just return we will return the array directly, we will get an empty array since we resolve before the promise resolves?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if we just return without the map, it'll be in a ResultAsync which we can ask downstream to consume isn't it? not sure if i'm misunderstanding you here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait ah value is of type ReportCard, but what we want is a list.

The distinction to make here is that ReportCard is for one domain, and we want to map over every domains to get this.

.andThen(() => okAsync(reportCard))
})

return ResultAsync.combineWithAllErrors(domainResolvers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually inspecting the signature for domainResolvers, the error type is never.

this is because we've discarded the error earlier (.orElse(() => okAsync([]))). this seems to suggest that either

  1. this won't error out or that
  2. all errors are equivalent to an empty response?

wondering if this assumption is true and if it's not, whether there's value in actually logging the error rather than just giving up and returning empty arrays

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we go with the second assumption. We are later intending to map this over /siteup for correctness, this is just a from of reporting for some level of sanity check that we have x number of domains under our control and they resolve to these dns values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure i'm understanding you correctly here! if the intent is as a form of reporting, i think preserving the errors and formatting them -> displaying has value. discarding the errors seems to be counter to the intent of reporting?

Copy link
Contributor Author

@kishore03109 kishore03109 May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm we are not discarding any errors here. we are simply reporting that we could not resolve to anything by returning an empty string. this is like a dns dig on the terminal and getting that it resolves to nothing mah

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could i just check how this feature is meant to be used? in my head, the flow is something like this:

  • user queries /siteup <domain>
  • we hit the backend here
  • we query for all the records
  • if any error
    • discard error and return []
  • if no errors
    • return the resolved records?

i think it's also worth nothing here that the code structure adopted here doesn't quite fit in with neverthrow - we're meant to recover and fix errors (you can observe this through the example for combineWithAllErrors here). this method actually discards successful results and returns only the errors.

i don't think there's any actionable wrt updating this to be idiomatic neverthrow but it's something we should consider

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user queries /siteup
we hit the backend here
we query for all the records
if any error
discard error and return []
if no errors
return the resolved records?

nope, siteup functionality is to remain as is. once we get a list of domains, the idea is to map through the list of domains via the site up so that every morning, we know if anything is not as what we expect it to be.

all this does is gets a list of domains -> then dig on them. note that the figuring out if the values returned form the dig are errors is not in this pr as of yet. again, the reason why we are ignoring this is because node:dig throws an error if it resolves to nothing, which does not mean that it is an error. it is valid, and expected for for eg isomer.gov.sg to not resolve anything for CAA results.

src/server.ts Outdated
Comment on lines 366 to 371
const monitoringService = new MonitoringService({
launchesService,
})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could i check why's this initialised here and not the support container?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm that would require creating launches service on the support container. no strong opinions on this, since this is a pretty light operation anyways

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to probe - when would this latency/cost matter? the support container is not user facing so latency doesn't really matter as opposed to creating it on the app container. (not that it's a big cost. i think the separation between user-facing and internal is more important here and what we want to maintain)

Copy link
Contributor Author

@kishore03109 kishore03109 May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not that it's a big cost. i think the separation between user-facing and internal is more important here and what we want to maintain)

so i am assuming we dont care about singleton pattern in our codebase ya?
since in this case, launchesService will be created twice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this might be due to a mistake in rebase! all shared services should be exported from "common/".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, moved already

generateReportCard(domains: IsomerHostedDomain[]) {
const reportCard: ReportCard[] = []

const domainResolvers = domains.map(({ domain, type }) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this function appears to be doing domain resolution and reporting the resolved records.

should we be splitting the method up so it's clearer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm what are you concretely suggesting ah? the reporting is just a logging, so the function when seperated becomes just a one liner?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeap, i think in my head, the flow is something like:

generateReportcard = (domains) => { 
  generateARecords().andThen(generateQuadA).andThen(generateCaa).andThen(generateCname).orElse(someError)
}

but this would potentially be expensive to do. ok w/ just shifting the domains out:

return generateDomains().andThen(() => {
        this.monitoringServiceLogger.info({
          message: "Report card generated",
          meta: {
            reportCard,
            date: new Date(),
          },
        })
        return okAsync(reportCard)
      })
      .orElse(() => okAsync([]))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shifted

)
)
.andThen((data: unknown) => {
if (!isKeyCdnResponse(data)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need the data here since we want to type cast directly from an unknown

@kishore03109 kishore03109 requested a review from seaerchin May 21, 2024 07:54
@kishore03109 kishore03109 force-pushed the 05-15-feat_monitoring_add_dns_reporter branch from fb7233d to d1fa7ea Compare May 24, 2024 02:15
Copy link
Contributor

@seaerchin seaerchin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgot to submit as review (sorry) - i think i'm not entirely sure how the errors will play out so i'm holding off on approval. if i could understand how you'd intend this code to function, i think it might be better

Copy link
Contributor

@seaerchin seaerchin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potential breakage with the plugin - should either update tsconfig or drop

// seems to be a bug in typing, this is a direct
// copy paste from the octokit documentation
// https://octokit.github.io/rest.js/v20#automatic-retries
const OctokitRetry = Octokit.plugin(retry as any)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this is required! saw your comment but going to the linked site doesn't seem to suggest this.

image

this might be due to our tsconfig not changing, which seems to be required by the retry library
image

i think we might want to update our tsconfig and check again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sadly i get this even after changing
Screenshot 2024-06-06 at 3 54 25 PM
Screenshot 2024-06-06 at 3 55 10 PM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @seaerchin, this has since been update by increasing the version number, thanks for the catch

@@ -112,6 +112,7 @@
"valueFrom": "STAGING_INCOMING_QUEUE_URL"
},
{ "name": "JWT_SECRET", "valueFrom": "STAGING_JWT_SECRET" },
{ "name": "KEYCDN_API_KEY", "valueFrom": "STAGING_KEYCDN_API_KEY" },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this just be required on support containers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved

src/monitoring/index.ts Outdated Show resolved Hide resolved
src/monitoring/index.ts Outdated Show resolved Hide resolved
src/monitoring/index.ts Outdated Show resolved Hide resolved

getAllDomains = () =>
ResultAsync.fromPromise(
this.launchesRepository.findAll(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there domains outside of the launches table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, which is why we also fetch domains from keycdn

@kishore03109 kishore03109 force-pushed the 05-15-feat_monitoring_add_dns_reporter branch from d1fa7ea to 381ac3b Compare June 6, 2024 07:04
@kishore03109 kishore03109 requested review from a team, harishv7 and seaerchin June 6, 2024 07:04
@kishore03109 kishore03109 force-pushed the 05-15-feat_monitoring_add_dns_reporter branch from eb096b3 to 61553f6 Compare June 18, 2024 02:34
@@ -98,6 +98,7 @@
"valueFrom": "PROD_ISOMERPAGES_REPO_PAGE_COUNT"
},
{ "name": "JWT_SECRET", "valueFrom": "PROD_JWT_SECRET" },
{ "name": "KEYCDN_API_KEY", "valueFrom": "PROD_KEYCDN_API_KEY" },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remember to add iether on pulumi or directly to ssm

import LaunchesService from "@root/services/identity/LaunchesService"
import promisifyPapaParse from "@root/utils/papa-parse"

interface MonitoringServiceInterface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: MonitoringServiceProps

@kishore03109 kishore03109 force-pushed the 05-15-feat_monitoring_add_dns_reporter branch from 8f6e62d to e16de47 Compare June 25, 2024 05:58
@kishore03109 kishore03109 force-pushed the 05-15-feat_monitoring_add_dns_reporter branch from e16de47 to 98001ce Compare June 27, 2024 01:37
Copy link
Contributor Author

kishore03109 commented Jun 27, 2024

Merge activity

@kishore03109 kishore03109 merged commit 8c47822 into develop Jun 27, 2024
11 of 12 checks passed
@kishore03109 kishore03109 deleted the 05-15-feat_monitoring_add_dns_reporter branch June 27, 2024 06:08
This was referenced Jun 27, 2024
@dcshzj dcshzj mentioned this pull request Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants