Skip to content

openedx/xapi-db-load

Scripts for generating Aspects xAPI events

Purpose

This package generates a variety of test data used for integration and performance testing of Open edX Aspects. Currently it populates the following datasets:

  • xAPI statements, simulating those generated by event-routing-backends
  • Course and learner data, simulating that generated by event-sink-clickhouse

The xAPI events generated match the current specifications of the Open edX event-routing-backends package, but are not yet maintained to advance alongside them so may be expected to fall out of sync over time. Almost all current statements are simulated, but statements that not yet used in Aspects reporting have been skipped.

Features

Once an appropriate database has been created using Aspects, data can be generated in the following ways:

Ralph to ClickHouse

Useful for testing configuration, integration, and permissions, this uses batch POSTs to Ralph for xAPI statements, but still writes directly to ClickHouse for course and actor data. This is the slowest method, but exercises the largest surface area of the project.

Direct to ClickHouse

Useful for getting a medium to large amount of data into the database to test configuration and view reports. xAPI statements are batched, other data is currently inserted one row at a time.

CSV files

Useful for creating datasets that can be reused for checking performance changes with the exact same data, and for extremely large tests. The files can be generated locally or on any service supported by smart_open. They can then optionally be imported to ClickHouse if written locally or to S3. They can also be directly imported from S3 to ClickHouse at any time using the load-db-from-s3 subcommand. This is by far the fastest method for large scale tests.

Getting Started

Usage

A configuration file is required to run a test. If no file is given, a small test will be run using the default_config.yaml included in the project:

❯ xapi-db-load load-db

To specify a config file:

❯ xapi-db-load load-db --config_file private_configs/my_huge_test.yaml

There is also a sub-command for just performing a load of previously generated CSV data from S3:

❯ xapi-db-load load-db-from-s3 --config_file private_configs/my_s3_test.yaml

Configuration Format

There are a number of different configuration options for tuning the output. In addition to the documentation below, there are example settings files to review in the example_configs directory.

Common Settings

These settings apply to all backends, and determine the size and makeup of the test:

# Location where timing logs will be saved
log_dir: logs

# xAPI statements will be generated in batches, the total number of
# statements is ``num_batches * batch_size``. The batch size is the number
# of statements sent to the backend (Ralph POST, ClickHouse insert, etc.)
num_batches: 3
batch_size: 100

# Overall start and end date for the entire run. All xAPI statements
# will fall within these dates. Different courses will have different start
# and end dates between these days, based on course_length_days below.
start_date: 2014-01-01
end_date: 2023-11-27

# All courses will be this long, they will be fit between start_date and
# end_date, therefore this must be less than end_date - start_date days.
course_length_days: 120

# The number of organizations, courses will be evenly spread among these
num_organizations: 3

# The number of learners to create, random subsets of these will be
# "registered" for each course and have statements generated for them
# between their registration date and the end of the course
num_actors: 10

# How many of each size course to create. The sum of these is the total
# number of courses created for the test. The keys are arbitrary, you can
# name them whatever you like and have as many or few sizes as you like.
# The keys must exactly match the definitions in course_size_makeup below.
num_course_sizes:
  small: 1
  medium: 1
  ...

# Course type configurations, how many of each type of object are created
# for each course of this size. "actors" must be less than or equal to
# "num_actors". Keys here must exactly match the keys in num_course_sizes.
course_size_makeup:
  small:
    actors: 5
    problems: 20
    videos: 10
    chapters: 3
    sequences: 10
    verticals: 20
    forum_posts: 20
  medium:
    actors: 7
    problems: 40
    videos: 20
    chapters: 4
    sequences: 20
    verticals: 30
    forum_posts: 40
  ...
CSV Backend, Local Files

Generates gzipped CSV files to a local directory:

backend: csv_file
csv_output_destination: logs/
CSV Backend, S3 Compatible Destination

Generates gzipped CSV files to remote location:

backend: csv_file
# This can be anything smart-open can handle (ex. a local directory or
# an S3 bucket etc.) but importing to ClickHouse using this tool only
# supports S3 or compatible services like MinIO right now.
# Note that this *must* be an s3:// link, https links will not work
# https://pypi.org/project/smart-open/
csv_output_destination: s3://openedx-aspects-loadtest/logs/large_test/

# These settings are shared with the ClickHouse backend
s3_key:
s3_secret:
CSV Backend, S3 Compatible Destination, Load to ClickHouse

Generates gzipped CSV files to a remote location, then automatically loads them to ClickHouse:

backend: csv_file
# csv_output_destination can be anything smart_open can handle, a local
# directory or an S3 bucket etc., but importing to ClickHouse using this
# tool only supports S3 or compatible services (ex: MinIO) right now
# https://pypi.org/project/smart-open/
csv_output_destination: s3://openedx-aspects-loadtest/logs/large_test/
csv_load_from_s3_after: true

# Note that this *must* be an https link, s3:// links will not work,
# this must point to the same location as csv_output_destination.
s3_source_location: https://openedx-aspects-loadtest.s3.amazonaws.com/logs/large_test/

# This also requires all of the ClickHouse backend variables!
ClickHouse Backend

Backend is only necessary if you are writing directly to ClickHouse, for integrations with Ralph or CSV, use their backend instead:

backend: clickhouse

Variables necessary to connect to ClickHouse, whether directly, through Ralph, or as part of loading CSV files:

# ClickHouse connection variables
db_host: localhost
# db_port is also used to determine the "secure" parameter. If the port
# ends in 443 or 440, the "secure" flag will be set on the connection.
db_port: 8443
db_username: ch_admin
db_password: secret

# Schema name for the xAPI schema
db_name: xapi

# Schema name for the event sink schema
db_event_sink_name: event_sink

# These S3 settings are shared with the CSV backend, but passed to
# ClickHouse when loading files from S3
s3_key: <...>
s3_secret: <...>
Ralph / ClickHouse Backend

Variables necessary to send xAPI statements via Ralph:

backend: ralph_clickhouse
lrs_url: http://ralph.tutor-nightly-local.orb.local/xAPI/statements
lrs_username: ralph
lrs_password: secret

# This also requires all of the ClickHouse backend variables!
Load from S3 configuration

Variables necessary to run xapi-db-load load-db-from-s3, which skips the event generation process and just loads pre-existing CSV files from S3:

# Note that this must be an https link, s3:// links will not work
s3_source_location: https://openedx-aspects-loadtest.s3.amazonaws.com/logs/large_test/

# This also requires all of the ClickHouse backend variables!

Developing

One Time Setup
# Clone the repository
git clone git@github.com:openedx/xapi-db-load.git
cd xapi-db-load

# Set up a virtualenv using virtualenvwrapper with the same name as the repo
# and activate it
mkvirtualenv -p python3.11 xapi-db-load
Every time you develop something in this repo
# Activate the virtualenv
workon xapi-db-load

# Grab the latest code
git checkout main
git pull

# Install/update the dev requirements
make requirements

# Run the tests and quality checks (to verify the status before you make any
# changes)
make validate

# Make a new branch for your changes
git checkout -b <your_github_username>/<short_description>

# Using your favorite editor, edit the code to make your change.
vim ...

# Run your new tests
pytest ./path/to/new/tests

# Run all the tests and quality checks
make validate

# Commit all your changes
git commit ...
git push

# Open a PR and ask for review.

Getting Help

Documentation

Start by going through the documentation (in progress!).

More Help

If you're having trouble, we have discussion forums at https://discuss.openedx.org where you can connect with others in the community.

Our real-time conversations are on Slack. You can request a Slack invitation, then join our community Slack workspace.

For anything non-trivial, the best path is to open an issue in this repository with as many details about the issue you are facing as you can provide.

https://github.com/openedx/xapi-db-load/issues

For more information about these options, see the Getting Help page.

License

The code in this repository is licensed under the AGPL 3.0 unless otherwise noted.

Please see LICENSE.txt for details.

Contributing

Contributions are very welcome. Please read How To Contribute for details.

This project is currently accepting all types of contributions, bug fixes, security fixes, maintenance work, or new features. However, please make sure to have a discussion about your new feature idea with the maintainers prior to beginning development to maximize the chances of your change being accepted. You can start a conversation by creating a new issue on this repo summarizing your idea.

The Open edX Code of Conduct

All community members are expected to follow the Open edX Code of Conduct.

People

The assigned maintainers for this component and other project details may be found in Backstage. Backstage pulls this data from the catalog-info.yaml file in this repo.

Reporting Security Issues

Please do not report security issues in public. Please email security@openedx.org.