I already wrote in regards to the doc characteristic in dbt and the way it helps create constant and correct documentation throughout all the dbt undertaking (see this). In brief, you possibly can retailer the outline of the most typical/essential columns used within the information fashions in your undertaking by including them within the docs.md information, which dwell within the docs folder of your dbt undertaking.
A quite simple instance of a orders.md file that comprises the outline for the most typical customer-related column names:
# Fields description## order_id
{% docs orders__order_id %}
Distinctive alphanumeric identifier of the order, used to hitch all order dimension tables
{% enddocs %}
## order_country
{% docs orders__order_country %}
Nation the place the order was positioned. Format is nation ISO 3166 code.
{% enddocs %}
## order_value
{% docs orders__value %}
Whole worth of the order in native foreign money.
{% enddocs %}
## order_date
{% docs orders__date %}
Date of the order in native timezone
{% enddocs %}
And its utilization within the .yml file of a mannequin:
columns:
- title: order_id
description: '{{ doc("orders__order_id") }}'
When the dbt docs are generated the outline of order_id can be at all times the identical, so long as the doc operate is used within the yml file of the mannequin. The good thing about having this centralized documentation is evident and simple.
The problem
Nonetheless, particularly with giant tasks and frequent adjustments (new fashions, or adjustments to current ones), it’s possible that the repository’s contributors will both neglect to make use of the doc operate, or they aren’t conscious {that a} particular column has been added to the docs folder. This has two penalties:
- somebody should catch this throughout PR assessment and request a change — assuming there’s no less than one reviewer who both is aware of all of the documented columns by coronary heart or at all times checks the docs folder manually
- if it’s simple to go unnoticed and depends on people, this setup defeats the aim of getting a centralized documentation.
The answer
The easy reply to this drawback is a CI (steady integration) examine, that mixes a GitHub workflow with a python script. This examine fails if:
- the adjustments within the PR are affecting a .yml file that comprises a column title current within the docs, however the doc operate is just not used for that column
- the adjustments within the PR are affecting a .yml file that comprises a column title current within the docs, however that column has no description in any respect
Let’s have a more in-depth have a look at the mandatory code and information to run this examine, and to a few examples. As beforehand talked about, there are two issues to contemplate: a (1) .yml file for the workflow and a (2) python file for the precise validation examine.
(1) That is how the validation_docs file seems to be like. It’s positioned within the github/workflows folder.
title: Validate Documentationon:
pull_request:
varieties: [opened, synchronize, reopened]
jobs:
validate_docs:
runs-on: ubuntu-latest
steps:
- title: Take a look at repository code
makes use of: actions/checkout@v3
with:
fetch-depth: 0
- title: Set up PyYAML
run: pip set up pyyaml
- title: Run validation script
run: python validate_docs.py
The workflow will run every time a pull request is open or re-open, and each time {that a} new commit is pushed to the distant department. Then there are principally 3 steps: retrieving the repository’s information for the present pull request, set up the dependencies, and run the validation script.
(2). Then the validate_docs.py script, positioned within the root folder of your dbt undertaking repository, that appears like this
import os
import sys
import yaml
import re
from glob import glob
from pathlib import Path
import subprocessdef get_changed_files():
diff_command = ['git', 'diff', '--name-only', 'origin/main...']
end result = subprocess.run(diff_command, capture_output=True, textual content=True)
changed_files = end result.stdout.strip().break up('n')
return changed_files
def extract_doc_names():
doc_names = set()
md_files = glob('docs/**/*.md', recursive=True)
doc_pattern = re.compile(r'{%s*docss+([^s%]+)s*%}')
for md_file in md_files:
with open(md_file, 'r') as f:
content material = f.learn()
matches = doc_pattern.findall(content material)
doc_names.replace(matches)
print(f"Extracted doc names: {doc_names}")
return doc_names
def parse_yaml_file(yaml_path):
with open(yaml_path, 'r') as f:
attempt:
return checklist(yaml.safe_load_all(f))
besides yaml.YAMLError as exc:
print(f"Error parsing YAML file {yaml_path}: {exc}")
return []
def validate_columns(columns, doc_names, errors, model_name):
for column in columns:
column_name = column.get('title')
description = column.get('description', '')
print(f"Validating column '{column_name}' in mannequin '{model_name}'")
print(f"Description: '{description}'")
doc_usage = re.findall(r'{{s*doc(["']([^"']+)["'])s*}}', description)
print(f"Doc utilization discovered: {doc_usage}")
if doc_usage:
for doc_name in doc_usage:
if doc_name not in doc_names:
errors.append(
f"Column '{column_name}' in mannequin '{model_name}' references undefined doc '{doc_name}'."
)
else:
matching_docs = [dn for dn in doc_names if dn.endswith(f"__{column_name}")]
if matching_docs:
suggested_doc = matching_docs[0]
errors.append(
f"Column '{column_name}' in mannequin '{model_name}' ought to use '{{{{ doc("{suggested_doc}") }}}}' in its description."
)
else:
print(f"No matching doc discovered for column '{column_name}'")
def fundamental():
changed_files = get_changed_files()
yaml_files = [f for f in changed_files if f.endswith('.yml') or f.endswith('.yaml')]
doc_names = extract_doc_names()
errors = []
for yaml_file in yaml_files:
if not os.path.exists(yaml_file):
proceed
yaml_content = parse_yaml_file(yaml_file)
for merchandise in yaml_content:
if not isinstance(merchandise, dict):
proceed
fashions = merchandise.get('fashions') or merchandise.get('sources')
if not fashions:
proceed
for mannequin in fashions:
model_name = mannequin.get('title')
columns = mannequin.get('columns', [])
validate_columns(columns, doc_names, errors, model_name)
if errors:
print("Documentation validation failed with the next errors:")
for error in errors:
print(f"- {error}")
sys.exit(1)
else:
print("All documentation validations handed.")
if __name__ == "__main__":
fundamental()
Let’s summarise the steps within the script:
- it lists all information which were modified within the pull request in comparison with the origin department.
- it seems to be by means of all markdown (.md) information inside the docs folder (together with subdirectories) and it searches for particular documentation block patterns utilizing a regex. Every time it finds such a sample, it extracts the doc_name half and provides it to a set of doc names.
- for every modified .yml file, the script opens and parses it utilizing yaml.safe_load_all. This converts the .yml content material into Python dictionaries (or lists) for simple evaluation.
- validate_columns: for every columns outlined within the .yml information, it checks the outline discipline to see if it features a {{ doc() }} reference. If references are discovered, it verifies that the referenced doc title really exists within the set of doc names extracted earlier. If not, it studies an error. If no doc references are discovered, it makes an attempt to see if there’s a doc block that matches this column’s title. Word that right here we’re utilizing a naming conference like doc_block__column_name. If such a block exists, it means that the column description ought to reference this doc.
Any issues (lacking doc references, non-existent referenced docs) are recorded as errors.
Examples
Now, let’s take a look on the CI in motion. Given the orders.md file shared at first of the article, we now push to distant this commit that comprises the ref_country_orders.yml file:
model: 2fashions:
- title: ref_country_orders
description: >
This mannequin filters orders from the staging orders desk to incorporate solely these with an order date on or after January 1, 2020.
It consists of info such because the order ID, order nation, order worth, and order date.
columns:
- title: order_id
description: '{{ doc("orders__order_id") }}'
- title: order_country
description: The nation the place the order was positioned.
- title: order_value
description: The worth of the order.
- title: order_address
description: The handle the place the order was positioned.
- title: order_date
The CI has failed. Clicking on the Particulars will take us to the log of the CI, the place we see this:
Validating column 'order_id' in mannequin 'ref_country_orders'
Description: '{{ doc("orders__order_id") }}'
Doc utilization discovered: ['orders__order_id']
Validating column 'order_country' in mannequin 'ref_country_orders'
Description: 'The nation the place the order was positioned.'
Doc utilization discovered: []
Validating column 'order_value' in mannequin 'ref_country_orders'
Description: 'The worth of the order.'
Doc utilization discovered: []
Validating column 'order_address' in mannequin 'ref_country_orders'
Description: 'The handle the place the order was positioned.'
Doc utilization discovered: []
No matching doc discovered for column 'order_address'
Validating column 'order_date' in mannequin 'ref_country_orders'
Description: ''
Doc utilization discovered: []
Let’s analyze the log:
– for the order_id column it discovered the doc utilization in its description.
– the order_address column isn’t discovered within the docs file, so it returns a No matching doc discovered for column ‘order_address’
– for the order_value and order_country, it is aware of that they’re listed within the docs however the doc utilization is empty. Identical for the order_date, and be aware that for this one we didn’t even add an outline line
All good thus far. However let’s maintain trying on the log:
Documentation validation failed with the next errors:
- Column 'order_country' in mannequin 'ref_country_orders' ought to use '{{ doc("orders__order_country") }}' in its description.
- Column 'order_value' in mannequin 'ref_country_orders' ought to use '{{ doc("orders__order_value") }}' in its description.
- Column 'order_date' in mannequin 'ref_country_orders' ought to use '{{ doc("orders__order_date") }}' in its description.
Error: Course of accomplished with exit code 1.
Since order_country, order_value, order_date are within the docs file, however the doc operate isn’t used, the CI elevate an error. And it suggests the precise worth so as to add within the description, which makes it extraordinarily simple for the PR creator to copy-paste the proper description worth from the CI log and add it into the .yml file.
After pushing the brand new adjustments the CI examine was succesfull and the log now seems to be like this:
Validating column 'order_id' in mannequin 'ref_country_orders'
Description: '{{ doc("orders__order_id") }}'
Doc utilization discovered: ['orders__order_id']
Validating column 'order_country' in mannequin 'ref_country_orders'
Description: '{{ doc("orders__order_country") }}'
Doc utilization discovered: ['orders__order_country']
Validating column 'order_value' in mannequin 'ref_country_orders'
Description: '{{ doc("orders__order_value") }}'
Doc utilization discovered: ['orders__order_value']
Validating column 'order_address' in mannequin 'ref_country_orders'
Description: 'The handle the place the order was positioned.'
Doc utilization discovered: []
No matching doc discovered for column 'order_address'
Validating column 'order_date' in mannequin 'ref_country_orders'
Description: '{{ doc("orders__order_date") }}'
Doc utilization discovered: ['orders__order_date']
All documentation validations handed.
For the order_address column, the log exhibits that no matching doc was discovered. Nonetheless, that’s superb and doesn’t trigger the CI to fail, since including that column to the docs file is just not our intention for this demonstration. In the meantime, the remainder of the columns are listed within the docs and are accurately utilizing the {{ doc() }} operate