No More Tableau Downtime: Metadata API for Proactive Data Health

In right this moment’s world, the reliability of information options is all the things. Once we construct dashboards and stories, one expects that the numbers mirrored there are appropriate and up-to-date. Based mostly on these numbers, insights are drawn and actions are taken. For any unexpected cause, if the dashboards are damaged or if the numbers are incorrect — then it turns into a fire-fight to repair all the things. If the problems usually are not fastened in time, then it damages the belief positioned on the information group and their options.

However why would dashboards be damaged or have incorrect numbers? If the dashboard was constructed appropriately the primary time, then 99% of the time the problem comes from the information that feeds the dashboards — from the information warehouse. Some attainable eventualities are:

Few ETL pipelines failed, so the brand new information shouldn’t be but in
A desk is changed with one other new one
Some columns within the desk are dropped or renamed
Schemas in information warehouse have modified
And plenty of extra.

There’s nonetheless an opportunity that the problem is on the Tableau website, however in my expertise, many of the occasions, it’s at all times on account of some adjustments in information warehouse. Regardless that we all know the basis trigger, it’s not at all times easy to start out engaged on a repair. There’s no central place the place you may test which Tableau information sources depend on particular tables. When you have the Tableau Data Management add-on, it might assist, however from what I do know, its arduous to seek out dependencies of customized sql queries utilized in information sources.

Nonetheless, the add-on is simply too costly and most firms don’t have it. The true ache begins when it’s important to undergo all the information sources manually to start out fixing it. On high of it, you’ve got a string of customers in your head impatiently ready for a quick-fix. The repair itself may not be tough, it might simply be a time-consuming one.

What if we might anticipate these points and establish impacted information sources earlier than anybody notices an issue? Wouldn’t that simply be nice? Nicely, there’s a approach now with the Tableau Metadata API. The Metadata API makes use of GraphQL, a question language for APIs that returns solely the information that you simply’re desirous about. For more information on what’s attainable with GraphQL, do take a look at GraphQL.org.

On this weblog publish, I’ll present you the way to connect with the Tableau Metadata API utilizing Python’s Tableau Server Consumer (TSC) library to proactively establish information sources utilizing particular tables, in an effort to act quick earlier than any points come up. As soon as you understand which Tableau information sources are affected by a selected desk, you can also make some updates your self or alert the homeowners of these information sources in regards to the upcoming adjustments to allow them to be ready for it.

Connecting to the Tableau Metadata API

Lets hook up with the Tableau Server utilizing TSC. We have to import in all of the libraries we would want for the train!

### Import all required libraries
import tableauserverclient as t
import pandas as pd
import json
import ast
import re

So as to hook up with the Metadata API, you’ll have to first create a private entry token in your Tableau Account settings. Then replace the & with the token you simply created. Additionally replace together with your Tableau website. If the connection is established efficiently, then “Linked” might be printed within the output window.

### Connect with Tableau server utilizing private entry token
tableau_auth = t.PersonalAccessTokenAuth("", "", 
                                           site_id="")
server = t.Server("https://dub01.on-line.tableau.com/", use_server_version=True)

with server.auth.sign_in(tableau_auth):
        print("Linked")

Lets now get a listing of all information sources which might be printed in your website. There are numerous attributes you may fetch, however for the present use case, lets preserve it easy and solely get the id, title and proprietor contact info for each information supply. This might be our grasp listing to which we are going to add in all different info.

############### Get all of the listing of information sources in your Web site

all_datasources_query = """ {
  publishedDatasources {
    title
    id
    proprietor {
    title
    e-mail
    }
  }
}"""
with server.auth.sign_in(tableau_auth):
    end result = server.metadata.question(
        all_datasources_query
    )

Since I need this weblog to be focussed on learn how to proactively establish which information sources are affected by a selected desk, I’ll not be going into the nuances of Metadata API. To higher perceive how the question works, you may consult with a really detailed Tableau’s personal Metadata API documentation.

One factor to notice is that the Metadata API returns information in a JSON format. Relying on what you’re querying, you’ll find yourself with a number of nested json lists and it may possibly get very tough to transform this right into a pandas dataframe. For the above metadata question, you’ll find yourself with a end result which would really like beneath (that is mock information simply to present you an concept of what the output seems like):

{
  "information": {
    "publishedDatasources": [
      {
        "name": "Sales Performance DataSource",
        "id": "f3b1a2c4-1234-5678-9abc-1234567890ab",
        "owner": {
          "name": "Alice Johnson",
          "email": "[email protected]"
        }
      },
      {
        "title": "Buyer Orders DataSource",
        "id": "a4d2b3c5-2345-6789-abcd-2345678901bc",
        "proprietor": {
          "title": "Bob Smith",
          "e-mail": "[email protected]"
        }
      },
      {
        "title": "Product Returns and Profitability",
        "id": "c5e3d4f6-3456-789a-bcde-3456789012cd",
        "proprietor": {
          "title": "Alice Johnson",
          "e-mail": "[email protected]"
        }
      },
      {
        "title": "Buyer Segmentation Evaluation",
        "id": "d6f4e5a7-4567-89ab-cdef-4567890123de",
        "proprietor": {
          "title": "Charlie Lee",
          "e-mail": "[email protected]"
        }
      },
      {
        "title": "Regional Gross sales Traits (Customized SQL)",
        "id": "e7a5f6b8-5678-9abc-def0-5678901234ef",
        "proprietor": {
          "title": "Bob Smith",
          "e-mail": "[email protected]"
        }
      }
    ]
  }
}

We have to convert this JSON response right into a dataframe in order that its straightforward to work with. Discover that we have to extract the title and e-mail of the proprietor from contained in the proprietor object.

### We have to convert the response into dataframe for simple information manipulation

col_names = end result['data']['publishedDatasources'][0].keys()
master_df = pd.DataFrame(columns=col_names)

for i in end result['data']['publishedDatasources']:
    tmp_dt = {okay:v for okay,v in i.gadgets()}
    master_df = pd.concat([master_df, pd.DataFrame.from_dict(tmp_dt, orient='index').T])

# Extract the proprietor title and e-mail from the proprietor object
master_df['owner_name'] = master_df['owner'].apply(lambda x: x.get('title') if isinstance(x, dict) else None)
master_df['owner_email'] = master_df['owner'].apply(lambda x: x.get('e-mail') if isinstance(x, dict) else None)

master_df.reset_index(inplace=True)
master_df.drop(['index','owner'], axis=1, inplace=True)
print('There are ', master_df.form[0] , ' datasources in your website')

That is how the construction of master_df would seem like:

Pattern output of code

As soon as we have now the principle listing prepared, we are able to go forward and begin getting the names of the tables embedded within the information sources. In case you are an avid Tableau person, you understand that there are two methods to deciding on tables in a Tableau information supply — one is to straight select the tables and set up a relation between them and the opposite is to make use of a customized sql question with a number of tables to attain a brand new resultant desk. Due to this fact, we have to tackle each the circumstances.

Processing of Customized SQL question tables

Under is the question to get the listing of all customized SQLs used within the website together with their information sources. Discover that I’ve filtered the listing to get solely first 500 customized sql queries. In case there are extra in your org, you’ll have to use an offset to get the subsequent set of customized sql queries. There’s additionally an choice of utilizing cursor methodology in Pagination while you need to fetch giant listing of outcomes (refer here). For the sake of simplicity, I simply use the offset methodology as I do know, as there are lower than 500 customized sql queries used on the positioning.

# Get the information sources and the desk names from all of the customized sql queries used in your Web site

custom_table_query = """  {
  customSQLTablesConnection(first: 500){
    nodes {
        id
        title
        downstreamDatasources {
        title
        }
        question
    }
  }
}
"""

with server.auth.sign_in(tableau_auth):
    custom_table_query_result = server.metadata.question(
        custom_table_query
    )

Based mostly on our mock information, that is how our output would seem like:

{
  "information": {
    "customSQLTablesConnection": {
      "nodes": [
        {
          "id": "csql-1234",
          "name": "RegionalSales_CustomSQL",
          "downstreamDatasources": [
            {
              "name": "Regional Sales Trends (Custom SQL)"
            }
          ],
          "question": "SELECT r.region_name, SUM(s.sales_amount) AS total_sales FROM ecommerce.sales_data.Gross sales s JOIN ecommerce.sales_data.Areas r ON s.region_id = r.region_id GROUP BY r.region_name"
        },
        {
          "id": "csql-5678",
          "title": "ProfitabilityAnalysis_CustomSQL",
          "downstreamDatasources": [
            {
              "name": "Product Returns and Profitability"
            }
          ],
          "question": "SELECT p.product_category, SUM(s.revenue) AS total_profit FROM ecommerce.sales_data.Gross sales s JOIN ecommerce.sales_data.Merchandise p ON s.product_id = p.product_id GROUP BY p.product_category"
        },
        {
          "id": "csql-9101",
          "title": "CustomerSegmentation_CustomSQL",
          "downstreamDatasources": [
            {
              "name": "Customer Segmentation Analysis"
            }
          ],
          "question": "SELECT c.customer_id, c.location, COUNT(o.order_id) AS total_orders FROM ecommerce.sales_data.Clients c JOIN ecommerce.sales_data.Orders o ON c.customer_id = o.customer_id GROUP BY c.customer_id, c.location"
        },
        {
          "id": "csql-3141",
          "title": "CustomerOrders_CustomSQL",
          "downstreamDatasources": [
            {
              "name": "Customer Orders DataSource"
            }
          ],
          "question": "SELECT o.order_id, o.customer_id, o.order_date, o.sales_amount FROM ecommerce.sales_data.Orders o WHERE o.order_status = 'Accomplished'"
        },
        {
          "id": "csql-3142",
          "title": "CustomerProfiles_CustomSQL",
          "downstreamDatasources": [
            {
              "name": "Customer Orders DataSource"
            }
          ],
          "question": "SELECT c.customer_id, c.customer_name, c.section, c.location FROM ecommerce.sales_data.Clients c WHERE c.active_flag = 1"
        },
        {
          "id": "csql-3143",
          "title": "CustomerReturns_CustomSQL",
          "downstreamDatasources": [
            {
              "name": "Customer Orders DataSource"
            }
          ],
          "question": "SELECT r.return_id, r.order_id, r.return_reason FROM ecommerce.sales_data.Returns r"
        }
      ]
    }
  }
}

Identical to earlier than after we have been creating the grasp listing of information sources, right here additionally we have now nested json for the downstream information sources the place we would want to extract solely the “title” a part of it. Within the “question” column, the complete customized sql is dumped. If we use regex sample, we are able to simply seek for the names of the desk used within the question.

We all know that the desk names at all times come after FROM or a JOIN clause and so they usually observe the format ... The is elective and many of the occasions not used. There have been some queries I discovered which used this format and I ended up solely getting the database and schema names, and never the whole desk title. As soon as we have now extracted the names of the information sources and the names of the tables, we have to merge the rows per information supply as there may be a number of customized sql queries utilized in a single information supply.

### Convert the customized sql response into dataframe
col_names = custom_table_query_result['data']['customSQLTablesConnection']['nodes'][0].keys()
cs_df = pd.DataFrame(columns=col_names)

for i in custom_table_query_result['data']['customSQLTablesConnection']['nodes']:
    tmp_dt = {okay:v for okay,v in i.gadgets()}

    cs_df = pd.concat([cs_df, pd.DataFrame.from_dict(tmp_dt, orient='index').T])

# Extract the information supply title the place the customized sql question was used
cs_df['data_source'] = cs_df.downstreamDatasources.apply(lambda x: x[0]['name'] if x and 'title' in x[0] else None)
cs_df.reset_index(inplace=True)
cs_df.drop(['index','downstreamDatasources'], axis=1,inplace=True)

### We have to extract the desk names from the sql question. We all know the desk title comes after FROM or JOIN clause
# Notice that the title of desk may be of the format ..
# Relying on the format of how desk known as, you'll have to modify the regex expression

def extract_tables(sql):
    # Regex to match database.schema.desk or schema.desk, keep away from alias
    sample = r'(?:FROM|JOIN)s+((?:[w+]|w+).(?:[w+]|w+)(?:.(?:[w+]|w+))?)b'
    matches = re.findall(sample, sql, re.IGNORECASE)
    return listing(set(matches))  # Distinctive desk names

cs_df['customSQLTables'] = cs_df['query'].apply(extract_tables)
cs_df = cs_df[['data_source','customSQLTables']]

# We have to merge datasources as there may be a number of customized sqls utilized in the identical information supply
cs_df = cs_df.groupby('data_source', as_index=False).agg({
    'customSQLTables': lambda x: listing(set(merchandise for sublist in x for merchandise in sublist))  # Flatten & make distinctive
})

print('There are ', cs_df.form[0], 'datasources with customized sqls utilized in it')

After we carry out all of the above operations, that is how the construction of cs_df would seem like:

Processing of standard Tables in Information Sources

Now we have to get the listing of all of the common tables utilized in a datasource which aren’t part of customized SQL. There are two methods to go about it. Both use the publishedDatasources object and test for upstreamTables or use DatabaseTable and test for upstreamDatasources. I’ll go by the primary methodology as a result of I need the outcomes at an information supply stage (mainly, I need some code able to reuse after I need to test a selected information supply in additional element). Right here once more, for the sake of simplicity, as an alternative of going for pagination, I’m looping by way of every datasource to make sure I’ve all the things. We get the upstreamTables within the sector object in order that needs to be cleaned out.

############### Get the information sources with the common desk names utilized in your website

### Its greatest to extract the tables info for each information supply after which merge the outcomes.
# Since we solely get the desk info nested underneath fields, in case there are a whole bunch of fields 
# utilized in a single information supply, we are going to hit the response limits and will be unable to retrieve all the information.

data_source_list = master_df.title.tolist()

col_names = ['name', 'id', 'extractLastUpdateTime', 'fields']
ds_df = pd.DataFrame(columns=col_names)

with server.auth.sign_in(tableau_auth):
    for ds_name in data_source_list:
        question = """ {
            publishedDatasources (filter: { title: """"+ ds_name + """" }) {
            title
            id
            extractLastUpdateTime
            fields {
                title
                upstreamTables {
                    title
                }
            }
            }
        } """
        ds_name_result = server.metadata.question(
        question
        )
        for i in ds_name_result['data']['publishedDatasources']:
            tmp_dt = {okay:v for okay,v in i.gadgets() if okay != 'fields'}
            tmp_dt['fields'] = json.dumps(i['fields'])
        ds_df = pd.concat([ds_df, pd.DataFrame.from_dict(tmp_dt, orient='index').T])

ds_df.reset_index(inplace=True)

That is how the construction of ds_df would look:

We are able to must flatten out the fields object and extract the sector names in addition to the desk names. For the reason that desk names might be repeating a number of occasions, we must deduplicate to maintain solely the distinctive ones.

# Perform to extract the values of fields and upstream tables in json lists
def extract_values(json_list, key):
    values = []
    for merchandise in json_list:
        values.append(merchandise[key])
    return values

ds_df["fields"] = ds_df["fields"].apply(ast.literal_eval)
ds_df['field_names'] = ds_df.apply(lambda x: extract_values(x['fields'],'title'), axis=1)
ds_df['upstreamTables'] = ds_df.apply(lambda x: extract_values(x['fields'],'upstreamTables'), axis=1)

# Perform to extract the distinctive desk names 
def extract_upstreamTable_values(table_list):
    values = set()a
    for inner_list in table_list:
        for merchandise in inner_list:
            if 'title' in merchandise:
                values.add(merchandise['name'])
    return listing(values)

ds_df['upstreamTables'] = ds_df.apply(lambda x: extract_upstreamTable_values(x['upstreamTables']), axis=1)
ds_df.drop(["index","fields"], axis=1, inplace=True)

As soon as we do the above operations, the ultimate construction of ds_df would look one thing like this:

We’ve all of the items and now we simply should merge them collectively:

###### Be part of all the information collectively
master_data = pd.merge(master_df, ds_df, how="left", on=["name","id"])
master_data = pd.merge(master_data, cs_df, how="left", left_on="title", right_on="data_source")

# Save the outcomes to analyse additional
master_data.to_excel("Tableau Information Sources with Tables.xlsx", index=False)

That is our closing master_data:

Desk-level Influence Evaluation

Let’s say there have been some schema adjustments on the “Gross sales” desk and also you need to know which information sources might be impacted. Then you may merely write a small perform which checks if a desk is current in both of the 2 columns — upstreamTables or customSQLTables like beneath.

def filter_rows_with_table(df, col1, col2, target_table):
    """
    Filters rows in df the place target_table is a part of any worth in both col1 or col2 (helps partial match).
    Returns full rows (all columns retained).
    """
    return df[
        df.apply(
            lambda row: 
                (isinstance(row[col1], listing) and any(target_table in merchandise for merchandise in row[col1])) or
                (isinstance(row[col2], listing) and any(target_table in merchandise for merchandise in row[col2])),
            axis=1
        )
    ]
# For instance 
filter_rows_with_table(master_data, 'upstreamTables', 'customSQLTables', 'Gross sales')

Under is the output. You may see that 3 information sources might be impacted by this modification. You may also alert the information supply homeowners Alice and Bob prematurely about this to allow them to begin engaged on a repair earlier than one thing breaks on the Tableau dashboards.

You may take a look at the whole model of the code in my Github repository here.

That is simply one of many potential use-cases of the Tableau Metadata API. You may also extract the sector names utilized in customized sql queries and add to the dataset to get a field-level influence evaluation. One can even monitor the stale information sources with the extractLastUpdateTime to see if these have any points or have to be archived if they don’t seem to be used any extra. We are able to additionally use the dashboards object to fetch info at a dashboard stage.

Closing Ideas

When you have come this far, kudos. This is only one use case of automating Tableau information administration. It’s time to replicate by yourself work and assume which of these different duties you might automate to make your life simpler. I hope this mini-project served as an pleasurable studying expertise to grasp the ability of Tableau Metadata API. For those who appreciated studying this, you may additionally like one other one among my weblog posts about Tableau, on among the challenges I confronted when coping with large .

Additionally do take a look at my earlier weblog the place I explored constructing an interactive, database-powered app with Python, Streamlit, and SQLite.

Earlier than you go…

Comply with me so that you don’t miss any new posts I write in future; one can find extra of my articles on my . You may also join with me on LinkedIn or Twitter!

Source link

Roleplay AI Chatbot Apps with the Best Memory: Tested

How to Perform Comprehensive Large Scale LLM Validation

What If I Had AI in 2020: Rent The Runway Dynamic Pricing Model

Roleplay AI Chatbot Apps with the Best Memory: Tested

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

The Costliest Startup Mistakes Are Made Before You Launch

Upgrade Your Workflow With This $40 AI Note-Taker

Leveraging Data to Predict Myocardial Infarction: A Machine Learning Approach | by Raonak Shukla | Jan, 2025

Our Picks

Roleplay AI Chatbot Apps with the Best Memory: Tested

Top Tools and Skills for AI/ML Engineers in 2025 | by Raviishankargarapti | Aug, 2025

PwC Reducing Entry-Level Hiring, Changing Processes

No More Tableau Downtime: Metadata API for Proactive Data Health

Connecting to the Tableau Metadata API

Processing of Customized SQL question tables

Processing of standard Tables in Information Sources

Desk-level Influence Evaluation

Closing Ideas

Related Posts