Skip to content

feat: add s3 xapi backfill#1208

Open
Ian2012 wants to merge 2 commits intocag/vector-defaultfrom
cag/vector-xapi-backfill
Open

feat: add s3 xapi backfill#1208
Ian2012 wants to merge 2 commits intocag/vector-defaultfrom
cag/vector-xapi-backfill

Conversation

@Ian2012
Copy link
Copy Markdown
Contributor

@Ian2012 Ian2012 commented Mar 20, 2026

This PR adds an optional s3 vector xapi sink which stores the events as a compressed json. It also adds a command to backfill the xapi events from that same s3 bucket with helpful options for backfill:

Setup the S3 sink:

Set the following options for the S3 sink:

ASPECTS_XAPI_S3_ACCESS_KEY: openedx
ASPECTS_XAPI_S3_BUCKET: xapi-events
ASPECTS_XAPI_S3_ENDPOINT: http://minio:9000
ASPECTS_XAPI_S3_REGION: us-east-1
ASPECTS_XAPI_S3_SECRET_KEY: bo3nnNMWCckDlpJIackIq3Km

The Vector S3 sink can be configure with:

ASPECTS_XAPI_S3_SINK_MAX_EVENTS: 10000
# Setting this option too can create too many small files in S3.
ASPECTS_XAPI_S3_SINK_TIMEOUT_SECS: 600

Execute the backfill command:

Usage: tutor dev do xapi-backfill [OPTIONS]

  Import xAPI events from S3 into ClickHouse.

  Examples:
      # All events
      tutor local do xapi-backfill
      # March 2026 
      tutor local do xapi-backfill --year 2026 --month 3
      # March 19 2026
      tutor local do xapi-backfill --year 2026 --month 03 --day 19
      # March 19 2026 at 2 pm (UTC)
      tutor local do xapi-backfill --year 2026 --month 03 --day 19 --hour 14
      # Custom path
      tutor local do xapi-backfill --path xapi/2026/03/19/14/*.log.gz\”
      # Backfill all events and deduplicate after insert
      tutor local do xapi-backfill --deduplicate

Options:
  --year TEXT    Year (e.g., '2026', default: '*' for all)
  --month TEXT   Month (e.g., '03' or '3', default: '*' for all)
  --day TEXT     Day (e.g., '19' or '9', default: '*' for all)
  --hour TEXT    Hour in 24h format (e.g., '14' or '3', default: '*' for all)
  --path TEXT    Relative S3 path (e.g., 'xapi/2026/03/19/14/*.log.gz').
                 Exclusive with date options.
  --deduplicate  Run deduplication after the backfill to remove duplicate
                 events.
  -h, --help     Show this message and exit.

@openedx-webhooks
Copy link
Copy Markdown

Thanks for the pull request, @Ian2012!

This repository is currently maintained by @bmtcril.

Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review.

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

  • If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
    • This process (including the steps you'll need to take) is documented here.
  • If it doesn't, simply proceed with the next step.
🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

  • Dependencies

    This PR must be merged before / after / at the same time as ...

  • Blockers

    This PR is waiting for OEP-1234 to be accepted.

  • Timeline information

    This PR must be merged by XX date because ...

  • Partner information

    This is for a course on edx.org.

  • Supporting documentation
  • Relevant Open edX discussion forum threads
🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

Details
Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

  • The size and impact of the changes that it introduces
  • The need for product review
  • Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

@openedx-webhooks openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Mar 20, 2026
@github-project-automation github-project-automation Bot moved this to Needs Triage in Contributions Mar 20, 2026
@Ian2012 Ian2012 force-pushed the cag/vector-xapi-backfill branch from 7e1537a to 122cdc0 Compare March 20, 2026 16:07
Copy link
Copy Markdown
Contributor

@bmtcril bmtcril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm super excited about this! Just a few comments.

command: -w -c /etc/vector/vector.toml
volumes:
- ../../data/vector:/var/lib/vector
- ../plugins/aspects/apps/vector/local.toml:/etc/vector/vector.toml
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be writable?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it never writes to it, and the watch option is for development only

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it looks like -w is on local here and the :ro got removed from the volume?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local should use this too as a vector restart didn't changed the vector run options. I had to delete the container with docker rm for changes to take effect

Comment thread tutoraspects/commands_v1.py Outdated
Comment thread tutoraspects/commands_v1.py
Comment thread tutoraspects/commands_v1.py Outdated
@Ian2012 Ian2012 force-pushed the cag/vector-xapi-backfill branch from 122cdc0 to be98847 Compare March 20, 2026 21:24
@Ian2012 Ian2012 force-pushed the cag/vector-xapi-backfill branch from be98847 to 09a485a Compare March 20, 2026 21:45
@mphilbrick211 mphilbrick211 moved this from Needs Triage to In Eng Review in Contributions Mar 23, 2026
Copy link
Copy Markdown
Contributor

@bmtcril bmtcril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is about ready to go, we just need to make the changes to the container volumes, and I think we need to make clear in the doc that this only works when Vector is the backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

open-source-contribution PR author is not from Axim or 2U

Projects

Status: In Eng Review

Development

Successfully merging this pull request may close these issues.

4 participants