Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Openalex Data Importer #439

Open
wants to merge 23 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
9be450d
open alex data import script
shadrach-tayo Jul 16, 2024
f4353fd
hide openalex folder
shadrach-tayo Jul 16, 2024
c755261
realtime data import and transform into relational model
shadrach-tayo Jul 29, 2024
7d319df
replace prisma with pg and drizzle orm
shadrach-tayo Jul 30, 2024
aca3fe5
add build and start script (pm2)
shadrach-tayo Jul 30, 2024
6a156c2
integrate pino http logger
shadrach-tayo Jul 30, 2024
9672c4c
Merge branch 'develop' into openalex-importer
shadrach-tayo Jul 30, 2024
775388f
add support for more tables
shadrach-tayo Jul 31, 2024
daa325e
update worksConcepts, mesh and topics from return data
shadrach-tayo Aug 2, 2024
2c2994d
add .env.example, remove script constraint
shadrach-tayo Aug 2, 2024
2265af4
remove prisma links
shadrach-tayo Aug 2, 2024
aa03580
clean up
shadrach-tayo Aug 2, 2024
5ae6192
clean up dockerfile and compose file
shadrach-tayo Aug 2, 2024
3a948f2
chore: clean up prisma client traces
shadrach-tayo Aug 2, 2024
35bd6c1
wip: add time traveling support for import script, update schema and …
shadrach-tayo Sep 6, 2024
0f2d1f4
complete time travel logic
shadrach-tayo Sep 6, 2024
f59e167
add todo
shadrach-tayo Sep 7, 2024
d44c093
Merge branch 'develop' into openalex-importer
shadrach-tayo Sep 16, 2024
fcbd7a5
add k8s config
shadrach-tayo Sep 16, 2024
aff7138
add k8s config, update local docker setup and npm scripts, integrate …
shadrach-tayo Sep 16, 2024
0376584
eslint fix
shadrach-tayo Sep 16, 2024
6528ac9
Merge branch 'develop' into openalex-importer
shadrach-tayo Jan 8, 2025
4b145f5
chore: fix formatting
shadrach-tayo Jan 8, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 118 additions & 0 deletions .github/workflows/build-openalex-importer.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# build.yml
on:
push:
paths:
- .github/workflows/**
- openalex-importer/**
- Dockerfile
branches: # array of glob patterns matching against refs/heads. Optional; defaults to all
- main # triggers on pushes that contain changes
- develop

name: Build openalex-importer

# https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html
env:
AWS_DEFAULT_REGION: us-east-2
AWS_DEFAULT_OUTPUT: json
AWS_ACCOUNT_ID: ${{ secrets.AWS_ACCOUNT_ID }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
CONTAINER_IMAGE: openalex-importer
DOCKER_BUILDKIT: 1

jobs:
build-and-push:
name: Build and deploy
runs-on: ubuntu-latest
steps:
- uses: hashicorp/setup-terraform@v1
- name: Checkout
uses: actions/checkout@v4

# Add steps here like linting, testing, minification, etc.
- id: install-aws-cli
uses: unfor19/install-aws-cli-action@v1
with:
version: 1

- uses: prepor/action-aws-iam-authenticator@master
- run: aws-iam-authenticator version

- name: Install Kubectl
run: |
#$(curl -Ls https://dl.k8s.io/release/stable.txt)
version=v1.23.6
echo "using kubectl@$version"
curl -sLO "https://dl.k8s.io/release/$version/bin/linux/amd64/kubectl" -o kubectl
chmod +x kubectl
mv kubectl /usr/local/bin
mkdir $HOME/.kube
sudo apt-get update
sudo apt-get install less
echo ${{ secrets.KUBE_CONFIG_DATA }} | base64 --decode > $HOME/.kube/config
aws sts get-caller-identity

- name: Build and tag the image (DEV)
if: github.ref == 'refs/heads/develop'
run: |
# Build and tag the image
docker build \
-t $CONTAINER_IMAGE-dev:latest \
-t $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$CONTAINER_IMAGE-dev \
./openalex-importer

- name: Build and tag the image (PROD)
if: github.ref == 'refs/heads/main'
run: |
# Build and tag the image
docker build \
-t $CONTAINER_IMAGE:latest \
-t $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$CONTAINER_IMAGE \
./openalex-importer

# Add additional steps here like scanning of image

# Only push to registry on master
- name: Push (DEV)
if: github.ref == 'refs/heads/develop'
run: |
# Push image to AWS ECR
aws ecr get-login-password --region $AWS_DEFAULT_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com
docker tag $CONTAINER_IMAGE-dev:latest $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$CONTAINER_IMAGE-dev:${{ github.sha }}
docker tag $CONTAINER_IMAGE-dev:latest $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$CONTAINER_IMAGE-dev:latest
docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$CONTAINER_IMAGE-dev:${{ github.sha }}
docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$CONTAINER_IMAGE-dev:latest

- name: Push (PROD)
if: github.ref == 'refs/heads/main'
run: |
# Push image to AWS ECR
aws ecr get-login-password --region $AWS_DEFAULT_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com
docker tag $CONTAINER_IMAGE:latest $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$CONTAINER_IMAGE:${{ github.sha }}
docker tag $CONTAINER_IMAGE:latest $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$CONTAINER_IMAGE:latest
docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$CONTAINER_IMAGE:${{ github.sha }}
docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$CONTAINER_IMAGE:latest

- name: Deploy to EKS (DEV)
# uses: steebchen/kubectl@v2.0.0
if: github.ref == 'refs/heads/develop'
run: | # defaults to latest kubectl binary version
kubectl apply -f openalex-importer/kubernetes/deployment.yaml
kubectl set image deployment/openalex-importer-dev openalex-importer-dev=$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$CONTAINER_IMAGE-dev:${{ github.sha }} --record

- name: Deploy to EKS (PROD)
if: github.ref == 'refs/heads/main'
run: | # defaults to latest kubectl binary version
kubectl apply -f openalex-importer/kubernetes/deployment.yaml
kubectl set image deployment/openalex-importer openalex-importer=$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$CONTAINER_IMAGE:${{ github.sha }} --record

- name: Verify EKS Deployment (DEV)
if: github.ref == 'refs/heads/develop'
run: |
kubectl rollout status deployment/openalex-importer-dev

- name: Verify EKS Deployment (PROD)
if: github.ref == 'refs/heads/main'
run: |
kubectl rollout status deployment/openalex-importer
28 changes: 0 additions & 28 deletions .github/workflows/build-repo-server.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,34 +25,6 @@ env:
DOCKER_BUILDKIT: 1

jobs:
# build-and-test:
# runs-on: ubuntu-latest
# # container:
# # image: ubuntu:latest
# # options: --user 1001

# steps:
# - name: Check out repository
# uses: actions/checkout@v2

# - name: Set up the environment
# # Replace this line with the appropriate setup for your project
# # Examples:
# uses: actions/setup-node@v2
# with:
# node-version: 16

# - name: Install dependencies
# run: cd desci-repo && yarn

# - name: Run tests
# run: |
# cd desci-repo && export DOCKER_BUILDKIT=1 && yarn && yarn test
# echo "exit code $?"
# if [ $? -ne 0 ]; then
# exit 1
# fi

build-and-push:
# needs: build-and-test
name: Build and deploy
Expand Down
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,3 @@ local-data/
.idea
**/*.iml
.composedbRuntimeDefinition.json
openalex-importer
5 changes: 5 additions & 0 deletions openalex-importer/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
config
dist
node_modules
.git
.env
9 changes: 9 additions & 0 deletions openalex-importer/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
NODE_ENV=development
OPENALEX_API_KEY=

PG_HOST=host.docker.internal
PG_PORT=5438
POSTGRES_USER=
POSTGRES_PASSWORD=
POSTGRES_DB=
DATABASE_URL=
45 changes: 45 additions & 0 deletions openalex-importer/.eslintrc.cjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
module.exports = {
parser: '@typescript-eslint/parser',

extends: ['plugin:@typescript-eslint/recommended', 'prettier', 'plugin:prettier/recommended'],

plugins: ['@typescript-eslint', 'import', 'simple-import-sort'],

rules: {
// General
'@typescript-eslint/no-unused-vars': 0,
'@typescript-eslint/explicit-module-boundary-types': 0,
'@typescript-eslint/no-explicit-any': 0,
'@typescript-eslint/no-non-null-assertion': 0,
'@typescript-eslint/ban-ts-comment': 0,
'@typescript-eslint/no-empty-interface': 0,
'simple-import-sort/imports': 'error',

// Import
'import/order': [
'error',
{
groups: ['builtin', 'external', 'internal', 'parent', 'sibling'],
'newlines-between': 'always',
alphabetize: {
order: 'asc',
caseInsensitive: true,
},
},
],
},

parserOptions: {
ecmaVersion: 2018,
sourceType: 'module',
},

settings: {
'import/resolver': {
node: {
extensions: ['.js', '.jsx', '.ts', '.tsx'],
moduleDirectory: ['node_modules', 'src/'],
},
},
},
};
20 changes: 20 additions & 0 deletions openalex-importer/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
node_modules
# Keep environment variables out of version control
.env
database/boilerplate/
database/test/
dist
.nebulus
log/access.log

temp_downloads
.nyc_output
coverage

server.log

repo-tmp

postgres-data/

logs
1 change: 1 addition & 0 deletions openalex-importer/.prettierignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
dist
7 changes: 7 additions & 0 deletions openalex-importer/.prettierrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"printWidth": 120,
"semi": true,
"singleQuote": true,
"trailingComma": "all"
}

15 changes: 15 additions & 0 deletions openalex-importer/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
FROM node:20.8.1-bullseye-slim

VOLUME /root/.yarn

RUN npm install -g npm@10.8.2

RUN mkdir /app
WORKDIR /app

COPY . .

RUN --mount=type=cache,target=/root/.yarn YARN_CACHE_FOLDER=/root/.yarn yarn install
RUN yarn build

CMD [ "yarn", "start" ]
79 changes: 79 additions & 0 deletions openalex-importer/README.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# OpenAlex Data Importer Script
This script aims to aid with realtime update of our openalex data imports

## Table of Contents
- [Introduction](#introduction)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Usage](#usage)
- [Running Locally](#running-locally)
- [Running in Production](#running-in-production)
- [Script Arguments](#script-arguments)
- [Common Commands](#common-commands)
- [Introspect Remote OpenAlex Schema](#introspect-remote-openalex-schema)
- [Generate Batches Migration](#generate-batches-migration)
- [Troubleshooting](#troubleshooting)
- [Contributing](#contributing)
- [License](#license)

## Introduction
[Brief explanation of OpenAlex and the purpose of this importer]

## Prerequisites
- Node.js (version X.X or higher)
- Docker
- [Any other requirements]

## Installation
[Steps to install the project]
```bash
git clone <github-url>
cd openalex-importer
yarn install
yarn build
yarn start
```

## Usage

### Running Locally
1. Install dependencies
2. Start Docker service
3. Introspect schema

### Running in Production
1. Install dependencies
2. Start Docker service to run script

### Script Arguments
Run the script using:
```bash
node ./index.js --start=08-24-2024 --end=09-05-2024
```
Note: Arguments are optional

## Common Commands

### Introspect Remote OpenAlex Schema
Set the following environment variables:
```bash
[Your existing env variables]
```
Then run:
```bash
npx drizzle-kit introspect
```

### Generate Batches Migration
```bash
npx drizzle-kit generate --schema=./drizzle/batches.ts --out=./drizzle --dialect=postgresql
```

## Troubleshooting
[Common issues and their solutions]

## Contributing
[How to contribute to the project]

## License
[License information]
40 changes: 40 additions & 0 deletions openalex-importer/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
services:
postgres:
container_name: 'openalex_db'
image: 'ankane/pgvector'
restart: always
healthcheck:
test: 'pg_isready -U postgres -d postgres'
interval: 5s
timeout: 15s
retries: 5
environment:
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=postgres
logging:
options:
max-size: 10m
max-file: '3'
ports:
- '5438:5432'
volumes:
- ./postgres/postgres-data:/var/lib/postgresql/data
# copy the sql script to create tables
- ./postgres/sql/vector_extension.sql:/docker-entrypoint-initdb.d/0-vector_extension.sql
- ./postgres/sql/create_tables.sql:/docker-entrypoint-initdb.d/create_tables.sql
openalex-importer:
container_name: 'openalex-importer'
command: yarn dev
build: .
env_file:
- .env
extra_hosts:
- host.docker.internal:host-gateway
depends_on:
postgres:
condition: service_healthy
links:
- postgres
volumes:
- .:/app/
- /app/node_modules
Loading