Performance: References Versus Duplicating

Performance: References Versus Duplicating

https://greyskullanalytics.com/power-query-duplicate-v-reference-queries/

Study Notes: Duplicate vs. Reference Queries in Power Query

Key Concepts

  1. Duplicate Queries:
    • Creates a full copy of the query, including all applied steps.
    • Changes made to the duplicate query do not affect the original query (and vice versa).
    • Independent queries.
  2. Reference Queries:
    • Does not copy steps; instead, it references the end result of the original query.
    • Changes to the original query cascade to the reference query.
    • The two queries remain linked.

Query Efficiency Misconception

  • Common Misunderstanding:
    • Reference queries are believed to reduce source data calls or utilize cached results from the original query.
  • Reality:
    • For relational databases, this is not true. Both queries execute fully, and no automatic caching occurs during data loading.
    • Reference queries work by referencing the code of the original query, not its data.

Use Cases

  1. When to Use Duplicate Queries:
    • Testing changes or transformations without risking the original query.
    • Useful in the Advanced Editor to avoid accidental mistakes while experimenting.
    • Keeps a copy of the original query steps as a backup.
  2. When to Use Reference Queries:
    • Dimensional Modeling:
      • Derive multiple related tables (fact and dimension) from a single source table.
      • Update the source in one place (the original query), and all reference queries reflect the change.
    • Data Lineage Setup:
      • Organize queries for clarity and maintainability using a folder structure:
        • Extract Folder: Load raw source data (no transformations).
        • Transform Folder: Reference and transform data (e.g., merge, append, aggregate).
        • Load Folder: Reference the transformed data to create the final output (cleaned, renamed, etc.).
      • Disable "Enable Load" for intermediate queries in Extract and Transform folders to improve performance.

Folder Structure Example

  1. Extract: Raw data (untouched).
    • E.g., Source tables: ProductsOrdersCustomers.
  2. Transform: Reference and shape data.
    • Perform actions such as merges and aggregations.
    • E.g., Combine Products and Categories into a denormalized Product Dimension table.
  3. Load: Final query for reporting or dashboards.
    • Rename columns, remove unused fields, or create summaries.
    • Enable only this layer for loading.

Caching Notes

  • Development Stage:
    • Caching occurs during query preview in Power Query for faster editing.
  • Load Stage:
    • No caching occurs during dataset loads unless explicitly designed (e.g., using APIs).

Tips and Best Practices

  • Duplicate Queries: Ideal for testing and creating backups.
  • Reference Queries: Follow DRY Principle (Don’t Repeat Yourself) by reusing queries to simplify updates and maintain consistency.
  • Use clear folder naming conventions for query organization:
    • Traditional: Extract → Transform → Load
    • Modern (Lakehouse): Bronze → Silver → Gold or Raw → Base → Curated.
  • Enable Load: Only for queries required in the final dataset to improve efficiency.

Practical Application

  • Testing Changes: Use duplicates for safe trial-and-error in query adjustments.
  • Building Fact and Dimension Tables: Use references to create a clear lineage of transformations.
  • Organizing Workflows: Set up Extract/Transform/Load (ETL) pipelines for structured and scalable data transformations.

Conclusion

  • Reference queries do not inherently reduce data processing but provide a way to reuse and link query logic.
  • Duplicate queries ensure safe experimentation and backup of query steps.
  • Understanding these distinctions is crucial for efficient Power Query workflows.

Your Practice Task:

  1. Create a folder structure (Extract, Transform, Load) for a sample dataset.
  2. Experiment with both duplicate and reference queries to observe their behavior.
  3. Try enabling/disabling the "Enable Load" option to see its impact on performance.

4o

https://greyskullanalytics.com/power-query-duplicate-v-reference-queries/