GitHub Actions Scraping

Automating scraping with GitHub Actions

I recently had to run a scraping script on a schedule in order to retrieve news articles from daily RSS Feeds (see more). I decided to use GitHub Actions to do so. This post is a quick tutorial on how to do it.

We want our workflow to:


Triggering the workflow on a schedule

In order to trigger the workflow on a schedule basis we can use the schedule event (see documentation). The schedule is defined using a cron syntax (see documentation).

on:
  schedule:
    - cron: "0 21 * * *" # run every day at 21:00 UTC

Pulling the latest version of the repository

In order to pull the latest version of the repository we can use the actions/checkout action (see documentation). This action will checkout the repository content to the github runner.

steps:
  - name: checkout repo content
    uses: actions/checkout@v3 # checkout the repository content to github runner.

Running a python script

In order to run a python script we need to setup python and install the required packages. We can use the actions/setup-python action (see documentation). This action will setup python and pip on the github runner. We can then install the required packages for our usage using pip install -r requirements.txt.

steps:
  - name: setup python
    uses: actions/setup-python@v4 # setup python
    with:
      python-version: '3.9'
      
  - name: install python packages # install packages from requirements.txt
    run: |
      python -m pip install --upgrade pip
      pip install -r requirements.txt      

We can also add other steps to install specific packages. For example, I needed to install the french model for spacy, which can be done using python -m spacy download fr_core_news_lg.

steps:
  - name: install spacy french model
    run: python -m spacy download fr_core_news_lg # install french model for spacy (used by my script)

Finally, we can run our python script named run.py using python run.py.

steps:
  - name: execute py script # run main.py
    run: python run.py

Commit and push the changes to the repository

In order to commit and push the changes to the repository we can use the ad-m/github-push-action action (see documentation). This action will commit and push the changes to the repository. We need to configure the action to use the GITHUB_TOKEN secret variable (see documentation) in order to allow to push the changes to the repository. To do so we need to set the autorization of the GITHUB_TOKEN secret variable to read and write in the repository settings:

github-token-settings

The workflow then automatically generates its own token, which is stored in the GITHUB_TOKEN secret variable. This token is used to push the changes to the repository. We’ll push the changes directly to the main branch.

steps:
  - name: commit files
    run: |
        git config --local user.email "action@github.com"
        git config --local user.name "GitHub Action"
        git add -A
        git diff-index --quiet HEAD || (git commit -a -m "updated logs" --allow-empty)
                  
  - name: push changes
    uses: ad-m/github-push-action@v0.6.0
    with:
        github_token: ${{ secrets.GITHUB_TOKEN }} # GITHUB_TOKEN is a secret variable that is automatically generated by GitHub Actions
        branch: main # push to the main branch

Summary

To summarize, here is the full workflow:

name: cron_scrap

on:
  schedule:
    - cron: "0 21 * * *" # run every day at 21:00 UTC

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
      - name: checkout repo content
        uses: actions/checkout@v3 # checkout the repository content to github runner.
        
      - name: setup python
        uses: actions/setup-python@v4 # setup python
        with:
          python-version: '3.9'
          
      - name: install python packages # install packages from requirements.txt
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt 
                    
      - name: install french model
        run: python -m spacy download fr_core_news_lg # install french model for spacy (used by my script)
        
      - name: execute py script # run main.py
        run: python run.py
          
      - name: commit files
        run: |
          git config --local user.email "action@github.com"
          git config --local user.name "GitHub Action"
          git add -A
          git diff-index --quiet HEAD || (git commit -a -m "updated logs" --allow-empty)
                    
      - name: push changes
        uses: ad-m/github-push-action@v0.6.0
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }} # GITHUB_TOKEN is a secret variable that is automatically generated by GitHub Actions
          branch: main # push to the main branch

I hope you found this quick tutorial on how to run a scraping script on a schedule using GitHub Actions useful!