How to Sync Simple Analytics Data to BigQuery using GitHub Actions

How to Sync Simple Analytics Data to BigQuery using GitHub Actions

Herman Schaaf
Name
Herman Schaaf
Twitter
https://github.com/hermanschaaf

Introduction

In this how-to guide, we will show you how you can load page views and events from Simple Analytics into BigQuery on regular schedule. We will use GitHub Actions to automate the process. This is a simple, reliable (and potentially free!) way to load Simple Analytics data into BigQuery on a regular basis.

Prerequisites

Before we start, to follow this guide you will need the following:

  • a Simple Analytics account to load data from,
  • a BigQuery dataset to sync to,
  • a GitHub repository to store the workflow and config files, and
  • the gcloud CLI (opens in a new tab) installed and configured. You may also choose to use the Google Cloud Console UI (or Terraform) instead, but we won't cover the steps for that here.

Now, let's get started!

Step 1: Add Workflow Configuration to the GitHub repository

Step 1.1: Add the workflow file

Inside your git repository, create a new file called .github/workflows/sync_simple_analytics_to_bigquery.yml and add the following content:

name: Sync Simple Analytics to BigQuery

on:
  workflow_dispatch:
  schedule:
    - cron: "0 2 * * *"  # daily at 02:00 UTC

jobs:
  simple-analytics-to-bigquery:
    timeout-minutes: 60
    runs-on: ubuntu-latest
    permissions:
      id-token: 'write' # This required for OIDC
      contents: 'read' # This is required for actions/checkout@v3
    steps:
      - uses: actions/checkout@v3
        with:
          fetch-depth: 2
      - name: Authenticate to Google Cloud
        uses: 'google-github-actions/auth@v0'
        with:
          workload_identity_provider: 'projects/${{ secrets.SA_BIGQUERY_PROJECT_NUMBER }}/locations/global/workloadIdentityPools/cloudquery-pool/providers/cloudquery-provider'
          service_account: 'cloudquery-service-account@${{ secrets.SA_BIGQUERY_PROJECT_ID }}.iam.gserviceaccount.com'
      - uses: cloudquery/setup-cloudquery@v3
        name: Setup CloudQuery
        with:
          version: "v2.3.1"
      - name: CloudQuery Sync
        run: cloudquery sync --log-console simple-analytics.yml bigquery.yml
        env:
          SA_USER_ID: ${{ secrets.SA_USER_ID }}
          SA_API_KEY: ${{ secrets.SA_API_KEY }}
          SA_BIGQUERY_PROJECT_ID: ${{ secrets.SA_BIGQUERY_PROJECT_ID }}
          SA_BIGQUERY_DATASET_ID: ${{ secrets.SA_BIGQUERY_DATASET_ID }}

This configuration allows us to run the workflow manually or on a schedule. The example above is configured to run daily at 2 a.m. UTC. The workflow will run on a self-hosted runner, which is a virtual machine hosted by GitHub. The workflow will run the following steps:

  1. Check out the repository
  2. Authenticate to Google Cloud using the google-github-actions/auth (opens in a new tab) Action
  3. Setup CloudQuery using the cloudquery/setup-cloudquery (opens in a new tab) Action
  4. Run the cloudquery sync command to load data from Simple Analytics into BigQuery

We will now configure the secrets and the configuration files.

Step 1.2: Add the secrets

In the GitHub repository, go to Settings > Secrets and add the following repository secrets:

  • SA_USER_ID: the user ID of your Simple Analytics account. You can find this in the Simple Analytics Account Settings (opens in a new tab).
  • SA_API_KEY: the API key of your Simple Analytics account. You can find this in the Simple Analytics Account Settings (opens in a new tab).
  • SA_BIGQUERY_PROJECT_ID: the ID of the Google Cloud project where your BigQuery dataset is located. You can find this in the Google Cloud Console home page when your project is selected.
  • SA_BIGQUERY_DATASET_ID: the ID of the BigQuery dataset where you want to load the data. You can find this in the BigQuery UI.
  • SA_BIGQUERY_PROJECT_NUMBER: the number of the Google Cloud project where your BigQuery dataset is located. You can find this in the Google Cloud Console home page when your project is selected.

Step 1.3: Add the Simple Analytics source plugin configuration file

In the git repository, create a new file called simple-analytics.yml and add the following content:

kind: source
spec:
  name: "simple-analytics"
  path: "simpleanalytics/simple-analytics"
  version: "v1.0.0"
  tables:
    ["*"]
  destinations:
    - "bigquery"
  spec:
    user_id: ${SA_USER_ID}
    api_key: ${SA_API_KEY}
    websites:
      - hostname: <your-website-hostname>

You may find the latest version on the releases page. Replace <your-website-hostname> with the hostname of your website as defined on Simple Analytics, e.g. mywebsite.com. You can also add more than one website, specify additional metadata that should be fetched or a period of days to fetch data for. See the Simple Analytics source plugin documentation for all the details.

Step 1.4: Add the BigQuery destination plugin configuration file

Now, let's set up the BigQuery destination. In the git repository, create a new file called bigquery.yml and add the following content:

kind: destination
spec:
  name: bigquery
  path: cloudquery/bigquery
  version: "v2.1.2"
  write_mode: "append"
  spec:
    project_id: ${SA_BIGQUERY_PROJECT_ID}
    dataset_id: ${SA_BIGQUERY_DATASET_ID}

You can find the latest version in the BigQuery destination plugin documentation (opens in a new tab). There are also a few more configuration options you may want to consider, like partitioning based on sync time.

Step 1.5: Push to the repository

With all this in place, make sure you to push the files to the GitHub repository. You can now run the workflow manually by going to Actions > Sync Simple Analytics to BigQuery > Run workflow and clicking the Run workflow button. It will run, but probably end with an error, because we still need to set up the Workload Identity to authenticate with BigQuery. Let's do that next.

Step 2: Set up Workload Identity

In this section, we will set up Workload Identity to allow the GitHub Actions runner to authenticate with BigQuery. This is required because the GitHub Actions runner is a self-hosted runner, which is not part of the Google Cloud project. We will use the google-github-actions/auth (opens in a new tab) and the steps in this guide are based on the instructions there, but fine-tuned for our use case.

Step 2.1: Set Environment Variables

First, let's set a few environment variables that we'll use throughout the guide. Set the following variables, replacing <your-project-id> and <your-repo> with your own values:

export PROJECT_ID=<your-project-id> # The ID of your Google Cloud project, e.g. "my-project"
export REPO=<your-repo> # e.g. cloudquery/simple-analytics-to-bigquery

Step 2.2: Create a Service Account

We will need a service account that can write to BigQuery. We will use the gcloud CLI to create the service account and grant it the necessary permissions. Run the following commands:

gcloud iam service-accounts create "cloudquery-service-account" \
  --project "${PROJECT_ID}"

If you haven't used the IAM Credentials API in the project before, it will need to be enabled:

gcloud services enable iamcredentials.googleapis.com \
  --project "${PROJECT_ID}"

Now we will grant the Google Cloud Service Account permissions to access Google Cloud resources. In this case, BigQuery:

We will assign the service account the BigQuery Data Editor role:

gcloud projects add-iam-policy-binding $PROJECT_ID --member="serviceAccount:cloudquery-service-account@${PROJECT_ID}.iam.gserviceaccount.com" --role=roles/bigquery.dataEditor

Step 2.3: Create a Workload Identity Pool and Provider

We will now set up Workload Identity, which the GitHub action will rely on to authenticate itself without the need for keys. You can read more about keyless authentication with GCP and GitHub Actions here (opens in a new tab).

First we need to create a pool:

gcloud iam workload-identity-pools create "cloudquery-pool" \
  --project="${PROJECT_ID}" \
  --location="global" \
  --display-name="Cloudquery pool"

Now let's describe the pool and set an environment variable that we'll use in subsequent steps:

export WORKLOAD_IDENTITY_POOL_ID=$(gcloud iam workload-identity-pools describe "cloudquery-pool" \
  --project="${PROJECT_ID}" \
  --location="global" \
  --format="value(name)")

We can use echo $WORKLOAD_IDENTITY_POOL_ID to verify that the variable is set correctly:

echo $WORKLOAD_IDENTITY_POOL_ID
# projects/<your-project-id>/locations/global/workloadIdentityPools/cloudquery-pool

Now let's create an OIDC provider:

gcloud iam workload-identity-pools providers create-oidc "cloudquery-provider" \
  --project="${PROJECT_ID}" \
  --location="global" \
  --workload-identity-pool="cloudquery-pool" \
  --display-name="Cloudquery provider" \
  --attribute-mapping="google.subject=assertion.sub,attribute.actor=assertion.actor,attribute.repository=assertion.repository" \
  --issuer-uri="https://token.actions.githubusercontent.com"

Finally, we need to grant the service account the Workload Identity User role:

gcloud iam service-accounts add-iam-policy-binding "cloudquery-service-account@${PROJECT_ID}.iam.gserviceaccount.com" \
  --project="${PROJECT_ID}" \
  --role="roles/iam.workloadIdentityUser" \
  --member="principalSet://iam.googleapis.com/${WORKLOAD_IDENTITY_POOL_ID}/attribute.repository/${REPO}"

Step 3: Query the data!

Now that we have set up the Workload Identity, we can run the workflow again. Go to Actions > Sync Simple Analytics to BigQuery > Run workflow and click the Run workflow button. You should see the workflow run successfully. You can also check the logs to see that the data was successfully written to BigQuery, or get messages about any errors that occurred. After a successful workflow run you should be able to query the data in BigQuery!

Let's head over to the BigQuery console and run the following query to test it out:

SELECT * FROM `cq-analytics-v1.simple_analytics.simple_analytics_page_views` LIMIT 10

BigQuery query results

Nice, we have raw Simple Analytics data in BigQuery! We can now use this data to build dashboards, reports, and more. The GitHub Action will run daily, so the data will be kept up to date automatically.

Step 4: Post-processing the data (Optional)

Now that we have extracted and loaded our data into BigQuery, we may also want to do the "T" part of ELT: transformation as a post-processing step. For example, because the BigQuery destination only supports append mode, we should make sure that our query results don't contain duplicates. One simple way to do this is to run a SQL to remove duplicates after the sync completes. We can do this by adding a few additional steps to the workflow that run deduplication queries as SQL.

Let's first write the queries. First we'll create deduplicate_page_views.sql:

CREATE
OR REPLACE TABLE `simple_analytics`.`simple_analytics_page_views` AS
SELECT
    *
FROM
    `simple_analytics`.`simple_analytics_page_views`
WHERE
    _cq_sync_time = (
    SELECT
    max(_cq_sync_time)
    FROM
    `simple_analytics`.`simple_analytics_page_views`
    );

And similarly, we'll create deduplicate_events.sql:

CREATE
OR REPLACE TABLE `simple_analytics`.`simple_analytics_events` AS
SELECT
    *
FROM
    `simple_analytics`.`simple_analytics_events`
WHERE
    _cq_sync_time = (
    SELECT
    max(_cq_sync_time)
    FROM
    `simple_analytics`.`simple_analytics_events`
    );

Now we can add these queries to the workflow. We'll add a few steps to the workflow to install the bq tool and run the queries:

  - name: Set up Cloud SDK
    uses: 'google-github-actions/setup-gcloud@v1'
  - name: Deduplicate page views
    run: bq query --use_legacy_sql=False --flagfile deduplicate_page_views.sql
  - name: Deduplicate events
    run: bq query --use_legacy_sql=False --flagfile deduplicate_events.sql

Conclusion

That's it! Now we have a fully automated pipeline that extracts data from Simple Analytics, loads it into BigQuery, and deduplicates it on a regular basis. We can now use this data to build dashboards, reports, and more. To see a live repository employing this workflow, check out the cloudquery/recipes repository (opens in a new tab).