Performance: References Versus Duplicating

https://greyskullanalytics.com/power-query-duplicate-v-reference-queries/

Study Notes: Duplicate vs. Reference Queries in Power Query

Key Concepts

Duplicate Queries:

Creates a full copy of the query, including all applied steps.
Changes made to the duplicate query do not affect the original query (and vice versa).
Independent queries.

Reference Queries:

Does not copy steps; instead, it references the end result of the original query.
Changes to the original query cascade to the reference query.
The two queries remain linked.

Query Efficiency Misconception

Common Misunderstanding:

Reference queries are believed to reduce source data calls or utilize cached results from the original query.

Reality:

For relational databases, this is not true. Both queries execute fully, and no automatic caching occurs during data loading.
Reference queries work by referencing the code of the original query, not its data.

Use Cases

When to Use Duplicate Queries:

Testing changes or transformations without risking the original query.
Useful in the Advanced Editor to avoid accidental mistakes while experimenting.
Keeps a copy of the original query steps as a backup.

When to Use Reference Queries:

Dimensional Modeling:

Derive multiple related tables (fact and dimension) from a single source table.
Update the source in one place (the original query), and all reference queries reflect the change.

Data Lineage Setup:

Organize queries for clarity and maintainability using a folder structure:

Extract Folder: Load raw source data (no transformations).
Transform Folder: Reference and transform data (e.g., merge, append, aggregate).
Load Folder: Reference the transformed data to create the final output (cleaned, renamed, etc.).

Disable "Enable Load" for intermediate queries in Extract and Transform folders to improve performance.

Folder Structure Example

Extract: Raw data (untouched).

E.g., Source tables: Products, Orders, Customers.

Transform: Reference and shape data.

Perform actions such as merges and aggregations.
E.g., Combine Products and Categories into a denormalized Product Dimension table.

Load: Final query for reporting or dashboards.

Rename columns, remove unused fields, or create summaries.
Enable only this layer for loading.

Caching Notes

Development Stage:

Caching occurs during query preview in Power Query for faster editing.

Load Stage:

No caching occurs during dataset loads unless explicitly designed (e.g., using APIs).

Tips and Best Practices

Duplicate Queries: Ideal for testing and creating backups.
Reference Queries: Follow DRY Principle (Don’t Repeat Yourself) by reusing queries to simplify updates and maintain consistency.
Use clear folder naming conventions for query organization:

Traditional: Extract → Transform → Load
Modern (Lakehouse): Bronze → Silver → Gold or Raw → Base → Curated.

Enable Load: Only for queries required in the final dataset to improve efficiency.

Practical Application

Testing Changes: Use duplicates for safe trial-and-error in query adjustments.
Building Fact and Dimension Tables: Use references to create a clear lineage of transformations.
Organizing Workflows: Set up Extract/Transform/Load (ETL) pipelines for structured and scalable data transformations.

Conclusion

Reference queries do not inherently reduce data processing but provide a way to reuse and link query logic.
Duplicate queries ensure safe experimentation and backup of query steps.
Understanding these distinctions is crucial for efficient Power Query workflows.

Your Practice Task:

Create a folder structure (Extract, Transform, Load) for a sample dataset.
Experiment with both duplicate and reference queries to observe their behavior.
Try enabling/disabling the "Enable Load" option to see its impact on performance.

4o

https://greyskullanalytics.com/power-query-duplicate-v-reference-queries/