Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gmail: update documentation #917

Merged
merged 7 commits into from
Jan 18, 2025
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .changeset/four-dots-tickle.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
'@openfn/language-gmail': patch
---

Tweak docs
296 changes: 192 additions & 104 deletions packages/gmail/README.md
Original file line number Diff line number Diff line change
@@ -1,103 +1,140 @@
# Gmail Message Content Extraction
# Gmail Adaptor

This adaptor is used to extract specific content from Gmail messages using
custom "desiredContent" configurations. The sample code specifies how to query
Gmail for messages and identify desired attachments and metadata.
## What it does

## How It Works
This adaptor is used to extract specific content from Gmail messages using custom desired "content" configurations. The sample code specifies how to query Gmail for messages and identify desired attachments and metadata.

1. Gmail Query: Constructs a Gmail query string to filter relevant messages.
2. Desired Content Matching: Uses the desiredContents array to identify and
extract:
- Metadata files
- Archive files and their contents
- Message metadata (subject, date, from, body)
3. Output: Returns a structured collection of matched content.
Without any parameters, the `getContentsFromMessages()` function will return an array containing every message in the account of the authenticated user including `from`, `date` and `subject`.

## Usage
A number of options are available to isolated the desired messages and to customize the output.

The adaptor's primary function is `getContentsFromMessages`
## Options

```js
getContentsFromMessages(userId, query, desiredContents, callback);
```
Optional parameters include: `contents`, `query`, `email`, `processedIds`

1. Set userId with the Gmail account to query.
2. Customize the query to contain filters as needed. This is the same format a
the query in the Gmail UI.
3. Specify what content to retrieve from messages (body, subject, attachments,
etc)
### options.contents

## Extracting Message Contents
Use the `options.contents` array to specify the content to retrieve from each message. Always included are `from`, `date`, and `subject`.

The `desiredContents` array should list what content to retrieve from each
message. Each item should be a string (ie, `"body"` or an object describing an
attachment)
Each item can be a simple string (ie, `'body'`, `'subject'`) or an MessageContent object offering advanced configuration.

### Metadata
#### Basic metadata

The following strings can be extracted:
The following types of content can be extracted:

- `body`: Extracts the message body.
- `subject`: Extracts the email subject.
- `date`: Extracts the timestamp of the email.
- `from`: Extracts the sender's information.

### Attachment: basic file
Optionally, each of these content strings can be expanded to include additional specifications:

```js
const mySubject = {
type: 'subject',
name: 'email-title',
maxLength: 25,
};
```

- The `type` property instructs the function which content type to extract.
- The `name` property allows you to add a custom name to this information.
- The `maxLength` property allows you to limit the length of the content returned.

#### Attachment: basic file

Extract content from a file attachment.

Extract the content from a file attachment. Specify the file name with a regular
expression on the `file` key.
`file`: Identify the specific file inside the archive by providing its name as a string or using a regular expression to matching a pattern.

```js
{
type: "file",
name: "metadata",
const myMetadata = {
type: 'file',
name: 'metadata',
file: /^summary\.txt$/,
}
};
```

```js
const myMetadata = {
type: 'file',
file: 'summary.txt',
maxLength: 500,
};
```

### Attachment: archived file
#### Attachment: archived file

Extract the content from a file embedded in an archive attachment.
Extract content from a file embedded in an archive attachment.

Specify the archive with a regular expression on the `archive` key. Extract a
file within the archive with the `file` key.
- `archive`: Specify the file name of the archive using either a string for an exact match or a regular expression to match a pattern.
- `file`: Identify the specific file inside the archive by providing its name as a string or using a regular expression to match a pattern.

```js
{
type: "archive",
name: "data",
archive: /_device_data\.zip$/,
const myArchivedFile = {
type: 'archive',
name: 'data',
archive: 'devicedata.zip',
file: /_CURRENT_DATA_\w*?\.json$/,
}
maxLength: 5000,
};
```

## Query Setup
```js
options.contents = [mySubject, 'body', myMetadata, myArchivedFile];
```

The query variable is constructed to filter Gmail messages:
### options.query

- Inbox messages with the subject containing "30DTR Data".
- Messages sent within the last 31 days.
Use a `query` parameter to filter the messages returned.

## Example
The query syntax supports the same query format as the Gmail `search` box.

```js
const userId = 'tester@gmail.com';
```
options.query = 'from:someuser@example.com rfc822msgid:<somemsgid@example.com> is:unread';
```

A full list of supported search operations can be found here: [Refine searches in Gmail](https://support.google.com/mail/answer/7190)

### options.email

Optionally specify the email address used for the Gmail account. This almost always the same email associated with the authenticated user so this parameter is optional.

```
options.email = '<EMAIL>';
```

const querySubject = encodeURIComponent('device data summary');
### options.processedIds

// All messages newer than 30 days ago
const MILLISECONDS_PER_DAY = 1000 * 60 * 60 * 24;
const queryAfterDate = new Date(Date.now() - 30 * MILLISECONDS_PER_DAY)
.toISOString()
.split('T')[0];
In some scenarios, it may be necessary to skip certain messages to prevent the retrieval of duplicate data. Passing an array of messageIds will allow the function to skip these messages if any of the ids are encountered in the returned messages.

const query = `in:inbox subject:${querySubject} after:${queryAfterDate}`;
```
options.processedIds = [
'194e3cf1ca0ccd66',
'283e2df2ca0ecd75',
'572e1af3ca0bcd84',
];
```

## Example jobs

```js
const query = 'in:inbox newer_than:2d';
const contents = ['body'];
getContentsFromMessages({ query, contents });
```

```js
const subject = 'device data summary'.replace(' ', '+');
const query = `in:inbox subject:${subject} newer_than:1m`;

const email = 'special_assigned_delegate@gmail.com';

const metadataFile = {
type: 'file',
name: 'metadata',
file: /^summary\.txt$/,
file: /summary\.txt$/,
maxLength: 500,
};

const dataFile = {
Expand All @@ -107,63 +144,114 @@ const dataFile = {
file: /_CURRENT_DATA_\w*?\.json$/,
};

const desiredContents = [
'body',
'subject',
'date',
'from',
metadataFile,
dataFile,
];
const contents = [metadataFile, dataFile];

getContentsFromMessages(userId, query, desiredContents, state =>
console.log(state.data)
);
getContentsFromMessages({ query, email, contents });
```

## Sample Output
## Sample `state.data` Output

For each matched message, the extracted content is returned in a collection of
content blocks. Here's an example for a single match message:
For each matched message, the extracted content is returned as a message object of content properties. Here's an example `state.data` for a single matched message:

```js
[
{
messageId: '1934c017c1752c01',
contents: [
{
name: 'subject',
value: 'Fwd: FW: Facility Temperature Report (Summary Data)',
},
{
name: 'date',
value: '2024-11-20T23:56:08.000Z',
},
{
name: 'from',
value: 'Friendly Sender <sender@gmail.com>',
},
{
name: 'metadata',
value:
'{\n "appInfo": {\n "isAutomaticTime": true,\n "isTrueTime": true,\n "os": "Android",\n "osVe" }',
path: '004800123457501820383131_20241115T102926Z.json',
},
{
name: 'data',
value:
'{\n "AMOD": "VL45",\n "AMFR": "ICECO",\n "ASER": "BJBC 08 30",\n "ADOP": "2024-04-01",\n "APQS": "E003/XX" }',
path: '004800123457501820383131_CURRENT_DATA_P100DT9H45M46S_20241115T102926Z.json',
},
],
from: 'Friendly Sender <sender@gmail.com>',
date: '2024-11-20T23:56:08.000Z',
subject: 'Fwd: FW: Facility Anomaly Report (Summary Data)',
metadata: {
filename: 'daily_summary.txt',
content: '{ "appInfo": { "isAutomaticTime": true }',
},
data: {
archiveFilename: '0031_device_data.zip',
filename: '0031_CURRENT_DATA_P100DT9H45M46S_20241115T102926Z.json',
content: '{ "AMOD": "VL45", "AMFR": "ICECO" }',
},
},
];
```

Each content block represents a specific piece of information extracted:
Each property on the message object represents a specific piece of information extracted:

- **subject**: Contains the email subject.
- **date**: The timestamp when the email was sent.
- **from**: Sender's email and name.
- **metadata**: Metadata file content, with its file path.
- **data**: Data file content, with its file path.
- **date**: The timestamp when the email was sent.
- **subject**: Contains the email subject.
- **metadata**: Metadata-named file content, with its matched file name.
- **data**: Data-named archive file content, with its matched archive name and file name.

## Acquiring an access token

The Gmail adaptor implicitly uses the Gmail account of the Google account that is used to authenticate the application.

Allowing the Gmail adaptor to access a Gmail account is a multi-step process.

### Create an OAuth 2.0 client ID

Follow the instructions are found here:
https://support.google.com/googleapi/answer/6158849

- Go to [Google Cloud Platform Console](https://console.cloud.google.com/)
- Go to "APIs & Services"
- Click "Credentials"
- Click "Create Credentials"
- Select "OAuth client ID"
- Select "Create OAuth client ID"
- Select Application type "Web application"
- Add a uniquely-identifiable name
- Click "Create"
- On the resulting popup screen, find and click "DOWNLOAD JSON" and save this file to a secure location.

### Use the Postman application to query the OAuth enpoint and retrieve an access token

Initially, you'll need to configure an authentication request using Postman's built-in OAuth 2.0 implementation:

- Open Postman
- Create a new request
- Switch to the "Authorization" tab
- On the left side, select Type OAuth 2.0
- On the right side, scroll down to the "Configure New Token" section
- Fill out the form using information from the downloaded json file from the
previous section
- Token Name: Google Oauth
- Grant Type: Authorization Code
- Auth URL: (found in the json file as auth_url)
- Access Token URL: (found in the json file as token_url)
- Client ID: (found in the json file as client_id)
- Client Secret: (found in the json file as client_secret)
- Scope: https://www.googleapis.com/auth/gmail.readonly
- State: (any random string is fine)
- Client Authentication: Send as Basic Auth header

Once the form is filled out, repeat these steps each hour to retrieve a new
access token:

- Click on "Get New Access Token"
- A browser will open and you'll be asked to authenticate with your Google Account
- Accept the request to allow this OAuth session to access your Google Account.
- In the MANAGE ACCESS TOKENS popup, find and copy the new Access Token
- This access token will be valid for 1 hour.

### Configure OpenFn CLI to find the access token

The Gmail adaptor looks for the access token in the configuration section under `access_token`.

Example configuration using a workflow:

```
"workflow": {
"steps": [
{
"id": "getGmailContent",
"adaptors": [
"gmail"
],
"expression": "path/to/gmail.js",
"configuration": {
"access_token": "(access token acquired from Postman)"
}
}
]
}
```
Loading
Loading