I recently had to run a scraping script on a schedule in order to retrieve news articles from daily RSS Feeds (see more). I decided to use GitHub Actions to do so. This post is a quick tutorial on how to do it.
We want our workflow to:
- Triggering the workflow on a schedule
- Pulling the latest version of the repository
- Running a python script
- Commit and push the changes to the repository
- Summary
Triggering the workflow on a schedule
In order to trigger the workflow on a schedule basis we can use the schedule
event (see documentation). The schedule is defined using a cron syntax (see documentation).
on:
schedule:
- cron: "0 21 * * *" # run every day at 21:00 UTC
Pulling the latest version of the repository
In order to pull the latest version of the repository we can use the actions/checkout
action (see documentation). This action will checkout the repository content to the github runner.
steps:
- name: checkout repo content
uses: actions/checkout@v3 # checkout the repository content to github runner.
Running a python script
In order to run a python script we need to setup python and install the required packages. We can use the actions/setup-python
action (see documentation). This action will setup python and pip on the github runner. We can then install the required packages for our usage using pip install -r requirements.txt
.
steps:
- name: setup python
uses: actions/setup-python@v4 # setup python
with:
python-version: '3.9'
- name: install python packages # install packages from requirements.txt
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
We can also add other steps to install specific packages. For example, I needed to install the french model for spacy, which can be done using python -m spacy download fr_core_news_lg
.
steps:
- name: install spacy french model
run: python -m spacy download fr_core_news_lg # install french model for spacy (used by my script)
Finally, we can run our python script named run.py
using python run.py
.
steps:
- name: execute py script # run main.py
run: python run.py
Commit and push the changes to the repository
In order to commit and push the changes to the repository we can use the ad-m/github-push-action
action (see documentation). This action will commit and push the changes to the repository. We need to configure the action to use the GITHUB_TOKEN
secret variable (see documentation) in order to allow to push the changes to the repository.
To do so we need to set the autorization of the GITHUB_TOKEN
secret variable to read and write in the repository settings:
The workflow then automatically generates its own token, which is stored in the GITHUB_TOKEN
secret variable. This token is used to push the changes to the repository. We’ll push the changes directly to the main
branch.
steps:
- name: commit files
run: |
git config --local user.email "action@github.com"
git config --local user.name "GitHub Action"
git add -A
git diff-index --quiet HEAD || (git commit -a -m "updated logs" --allow-empty)
- name: push changes
uses: ad-m/github-push-action@v0.6.0
with:
github_token: ${{ secrets.GITHUB_TOKEN }} # GITHUB_TOKEN is a secret variable that is automatically generated by GitHub Actions
branch: main # push to the main branch
Summary
To summarize, here is the full workflow:
name: cron_scrap
on:
schedule:
- cron: "0 21 * * *" # run every day at 21:00 UTC
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: checkout repo content
uses: actions/checkout@v3 # checkout the repository content to github runner.
- name: setup python
uses: actions/setup-python@v4 # setup python
with:
python-version: '3.9'
- name: install python packages # install packages from requirements.txt
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: install french model
run: python -m spacy download fr_core_news_lg # install french model for spacy (used by my script)
- name: execute py script # run main.py
run: python run.py
- name: commit files
run: |
git config --local user.email "action@github.com"
git config --local user.name "GitHub Action"
git add -A
git diff-index --quiet HEAD || (git commit -a -m "updated logs" --allow-empty)
- name: push changes
uses: ad-m/github-push-action@v0.6.0
with:
github_token: ${{ secrets.GITHUB_TOKEN }} # GITHUB_TOKEN is a secret variable that is automatically generated by GitHub Actions
branch: main # push to the main branch
I hope you found this quick tutorial on how to run a scraping script on a schedule using GitHub Actions useful!