DATA LOADING - Detailed Explanation

PowerBI_Data_Loading_Detailed_Guide

DATA LOADING - Detailed Explanation

Let me break down each step with practical details, examples, and best practices:

1. Identify all data sources needed for the project

What this means:

Before you start connecting to any data, you need to map out ALL the data sources your project requires. This is strategic planning before execution.

Why it matters:

Prevents multiple trips back to stakeholders asking for more data
Helps you understand data dependencies early
Allows you to plan for different connection methods
Identifies potential data quality issues before you start building

How to do it

Step 1: Understand the business requirements

What questions need to be answered?
What metrics need to be tracked?
What time periods are needed?
Who are the end users and what do they need to see?

Step 2: List out all required data entities Example for a Sales Report:

Sales transactions (fact data)
Products/inventory information
Customer details
Sales territories
Dates (you’ll create this, but note it)
Targets/budgets
Employee/sales rep information

Step 3: Map each entity to its source location Example mapping:

Sales Transactions → SQL Server Database (Sales_DB → dbo.Transactions)
Products → Excel file (SharePoint → ProductMaster.xlsx)
Customers → CRM System (via API or export)
Territories → Excel file (Local → Territories.xlsx)
Targets → Excel file (SharePoint → Monthly_Targets.xlsx)
Employees → SQL Server (HR_DB → dbo.Employees)

Step 4: Document access requirements

Do you have access to each source?
What credentials are needed?
Are any sources behind firewalls?
Do you need approval or special permissions?

Common Pitfall:

Starting to build before identifying all sources, then having to rebuild the model when new data sources are discovered.

2. Use appropriate connectors for each source

What this means:

Power BI has specific connectors optimized for different data sources. Using the right connector ensures better performance, query folding, and fewer errors.

Available Connector Types:

File-based connectors:

Excel: For .xlsx, .xls files
CSV/Text: For .csv, .txt files
XML: For XML files
JSON: For JSON files
PDF: For extracting tables from PDFs
Folder: For loading multiple files from a directory

Database connectors:

SQL Server: For Microsoft SQL databases
Oracle: For Oracle databases
MySQL/PostgreSQL: For these database systems
Azure SQL Database: For cloud-based SQL
Access: For Microsoft Access databases

Cloud & Online Services:

SharePoint Online List/Folder: For SharePoint data
OneDrive: For files stored in OneDrive
Dataverse: For Power Platform data
Dynamics 365: For Dynamics data
Salesforce: For CRM data

Web-based:

Web: For web pages and APIs
OData Feed: For OData services
REST APIs: Using Web connector with authentication

How to choose the right connector:

Example 1: Excel file stored on SharePoint

Wrong approach: Download file locally and use Excel connector
Right approach: Use SharePoint Folder connector
Why: Direct connection means automatic updates, no manual downloads

Example 2: SQL Server data

Wrong approach: Export to Excel then load Excel
Right approach: Use SQL Server connector directly
Why:

Query folding (transformations happen in database, much faster)
Always up-to-date
Can handle much larger datasets
Better performance

Example 3: Multiple Excel files in a folder with same structure

Wrong approach: Load each file separately
Right approach: Use Folder connector, then combine files
Why: Automatic inclusion of new files, one transformation for all

How to access connectors:

Method 1: Home Ribbon - Open Power BI Desktop - Home tab → Get Data → More… - Search for your connector type - Click Connect

Method 2: Recent Sources - Home tab → Get Data → Recent Sources - Quick access to previously used sources

Connector Best Practices:

Import vs DirectQuery: - Import Mode (default): Data is loaded into Power BI’s memory - Use when: Dataset is under 1GB, data doesn’t change frequently, need fast performance

DirectQuery Mode: Queries go to source each time visual is refreshed

Use when: Dataset is very large (multi-GB), need real-time data, data changes constantly

Connection Settings: When connecting, you’ll see options like:

SQL Server connection example:
Server: myserver.database.windows.net
Database: Sales_DB
Data Connectivity mode:
  ○ Import (recommended)
  ○ DirectQuery
Advanced options:
  - SQL statement (optional)
  - Command timeout
  - Relationship columns

3. Load data to Power Query (not directly to model) for transformation

What this means:

When you connect to a data source, Power BI gives you two options: 1. Load: Brings data straight into the model 2. Transform Data: Opens Power Query Editor first

Always choose Transform Data (or at minimum, load to Power Query before loading to model).

Why this is critical:

Reason 1: Data is rarely clean Raw data usually has: - Wrong data types (dates stored as text) - Extra rows (headers, footers, blank rows) - Inconsistent formatting (spaces, mixed case) - Unnecessary columns - Null values that need handling

Reason 2: Performance optimization Transforming in Power Query is more efficient than calculated columns in DAX later.

Reason 3: Documentation and maintenance Power Query shows all transformation steps, making it easy to see what was done and modify later.

How to do it:

When first connecting:

1. Click "Get Data" → Select your source
2. Navigate to your file/table
3. You'll see a preview window
4. At the bottom, you have options:
   - Load (DON'T click this)
   - Transform Data (CLICK THIS)
   - Load To... (for advanced scenarios)

If you accidentally clicked Load:

1. Right-click the query in the Queries pane (right side)
2. Select "Edit Query"
3. Power Query Editor will open

Understanding the Power Query Editor:

When you open Transform Data, you see: - Left Pane: List of all queries - Center: Data preview - Right Pane: Query Settings - Properties (query name) - Applied Steps (all transformations)

Applied Steps Example:

Applied Steps:
1. Source (connected to data)
2. Navigation (selected specific sheet/table)
3. Promoted Headers (first row became column names)
4. Changed Type (auto-detected data types)
← These are automatic

Below this, you add your own steps:
5. Removed Columns
6. Filtered Rows
7. Replaced Values
... and so on

The Power Query Workflow:

Data Source → Power Query Editor → Apply Transformations → Close & Apply → Data Model

NOT this:

Data Source → Data Model (too late to transform efficiently)

4. Check data preview to understand structure and data types

What this means:

Before you start any transformations, you need to thoroughly examine your data to understand what you’re working with.

Why it matters:

Helps you plan your transformations
Identifies data quality issues early
Prevents errors downstream
Informs your data model design

What to look for:

1. Column Headers: - Are headers in the first row, or do you need to promote them? - Are header names clear and descriptive? - Are there duplicate column names? - Are there special characters or spaces in names?

Example issue:

Before promoting headers:
Column1    Column2    Column3
ProductID  Product    Price
12345      Widget A   50

After promoting headers:
ProductID  Product    Price
12345      Widget A   50

2. Data Types (CRITICAL): Look at the icon next to each column name: - ABC = Text - 123 = Whole Number - 1.2 = Decimal Number - 📅 = Date - 🕐 = Date/Time - ✓/✗ = True/False - $ = Currency (if explicitly set)

Common data type issues:

Issue 1: Dates stored as text
Date column shows: "01/12/2024", "02/12/2024"
Icon: ABC (text) - WRONG
Should be: 📅 (date)

Issue 2: Numbers stored as text (can't sum them)
Amount column shows: "1000", "2000", "3000"
Icon: ABC (text) - WRONG
Should be: 123 or 1.2 (number)

Issue 3: IDs stored as numbers
ProductID shows: 12345, 12346, 12347
Icon: 123 (number)
Should be: ABC (text) - because you won't do math on IDs

3. Data Quality Indicators:

Power BI shows three indicators for each column: - Column Quality: Valid, Error, Empty percentages - Column Distribution: Histogram showing value distribution - Column Profile: Detailed statistics

How to enable: - Go to View tab in Power Query - Check: Column Quality, Column Distribution, Column Profile

Example interpretation:

Product Name column:
Quality: 98% valid, 0% error, 2% empty
Distribution: Shows most products appear 50-100 times
Profile: 1,250 distinct products out of 50,000 rows

Action needed: Decide how to handle 2% empty values

4. Sample Data: Scroll through the preview to see: - Are there blank rows that need removal? - Are there subtotal/total rows mixed in? - Is the data consistent or are there format changes midway? - Are there any obvious errors or outliers?

5. Row Count: Bottom left shows “Showing top 1000 rows” - Power Query preview shows only first 1000 rows by default - Your transformations apply to ALL rows, not just preview

Practical Checklist:

☐ Headers are in the correct row
☐ No blank columns
☐ No blank rows
☐ All data types are logical for each column
☐ Date columns are Date type (not text)
☐ Number columns are Number type (not text)
☐ ID columns are Text type (not number)
☐ No obvious errors or outliers
☐ Understand what nulls/blanks represent
☐ Know the grain of the data (what does each row represent?)

5. Apply folder/file filtering if loading from folders

What this means:

When you use the Folder connector to load multiple files, you often need to filter which files are actually loaded to avoid bringing in unwanted data.

When you use Folder connector:

Scenario 1: Multiple sales files

Folder structure:
📁 Sales Reports
  ├─ Sales_January_2024.xlsx
  ├─ Sales_February_2024.xlsx
  ├─ Sales_March_2024.xlsx
  ├─ Budget_2024.xlsx (don't want this)
  └─ Archive_2023.xlsx (don't want this)

Scenario 2: Daily exports

📁 Daily Exports
  ├─ Export_2024-01-01.csv
  ├─ Export_2024-01-02.csv
  ├─ Export_2024-01-03.csv
  ... hundreds of files
  └─ Export_2024-12-31.csv

How to connect to folder:

Step 1: Get Data → Folder

1. Click Get Data
2. Select "Folder"
3. Browse or paste folder path
4. Click OK

Step 2: You’ll see a list of all files The preview shows columns: - Content (binary file) - Name (filename) - Extension (.xlsx, .csv, etc.) - Date accessed - Date modified - Date created - Attributes - Folder Path

Step 3: Filter before combining

Filtering Techniques:

Method 1: Filter by extension

1. Click dropdown on "Extension" column
2. Uncheck extensions you don't want
3. Example: Keep only .xlsx, uncheck .pdf, .docx, etc.

Method 2: Filter by filename pattern

1. Click dropdown on "Name" column
2. Text Filters → Contains
3. Enter pattern (e.g., "Sales" to get only files with "Sales" in name)
4. OR use "Begins With" or "Ends With"

Example filters:
- Begins with "Sales_" → Gets Sales_January.xlsx but not Budget.xlsx
- Contains "2024" → Gets only 2024 files
- Does Not Contain "Archive" → Excludes archived files

Method 3: Filter by date

1. Click dropdown on "Date modified" column
2. Date Filters → After
3. Select date (e.g., 01/01/2024 to get only current year files)

Useful for:
- Loading only recent files
- Excluding old versions
- Incremental loads

Method 4: Keep/Remove specific files

1. Click on Name column values
2. Right-click → Remove (to exclude specific files)
3. Or use "Keep Only Selected Rows"

After Filtering - Combine Files:

Step 4: Combine Files

1. After filtering, click "Combine Files" button at bottom
2. Power BI shows "Combine Files" dialog
3. Select the sheet/table to combine from
4. Click OK

Power BI will: - Create a function to extract data from each file - Apply same transformations to all files - Combine all files into one table - Add “Source.Name” column showing which file each row came from

Example Result:

Order ID | Product | Amount | Source.Name
1001     | Widget  | 100    | Sales_January_2024.xlsx
1002     | Gadget  | 200    | Sales_January_2024.xlsx
1003     | Widget  | 150    | Sales_February_2024.xlsx
1004     | Tool    | 300    | Sales_February_2024.xlsx

Common Folder Loading Patterns:

Pattern 1: Same structure monthly files

File naming: Sales_YYYY_MM.xlsx
Filter: Contains "Sales_"
Result: All sales files combined automatically
When new month file is added, it's automatically included in next refresh

Pattern 2: Daily transaction files

File naming: Transactions_YYYY-MM-DD.csv
Filter: Extension = .csv AND Date modified > last year
Result: Only recent CSV files loaded
Old files ignored automatically

Pattern 3: Multiple product category files

Files: Electronics.xlsx, Clothing.xlsx, Furniture.xlsx, Template.xlsx
Filter: Does not contain "Template"
Result: All product files except template

Best Practices:

1. Filter early Don’t combine all files then filter later. Filter the file list BEFORE combining to improve performance.

2. Use consistent file naming

Good naming convention:
  Sales_2024_01.xlsx
  Sales_2024_02.xlsx
  Sales_2024_03.xlsx

Bad naming convention:
  January Sales.xlsx
  Feb2024sales.xlsx
  March_SalesData.xlsx

3. Keep file structure consistent All files should have: - Same columns - Same column names - Same data types - Headers in same position

4. Test with sample Before loading hundreds of files, test your filter logic with just a few files first.

5. Document your filters Add a note explaining why certain files are excluded (for future reference).

6. Document source connections

What this means:

Keep a clear record of WHERE your data comes from and HOW to access it. This is essential for maintenance, troubleshooting, and handover to other team members.

Why it matters:

When refresh fails: You know exactly where to check
When moving between environments: (Dev to Production) you can easily update connections
When someone else takes over: They understand the data sources
For audit purposes: Trace where data originates
For troubleshooting: Quickly identify which source is causing issues

What to document:

For Each Data Source, Record:

1. Source Type - SQL Server - Excel file - SharePoint list - API endpoint - Web URL - etc.

2. Connection Details

For SQL Server:

Server Name: sql-prod-server.database.windows.net
Database Name: SalesDB
Authentication Type: Windows Authentication / SQL Server / Azure AD
Service Account: domain\serviceaccount (if using service account)
Tables/Views Used:
  - dbo.FactSales
  - dbo.DimProduct
  - dbo.DimCustomer

For Excel/CSV Files:

File Location: \\sharepoint\sites\Sales\Documents\
File Name: ProductMaster.xlsx
Sheet/Table Name: Products
Last Updated: 2024-01-15
Owner: John Doe (john.doe@company.com)
Update Frequency: Monthly (1st of each month)

For SharePoint:

Site URL: https://company.sharepoint.com/sites/Sales
Library: Shared Documents
Folder Path: /Reports/Monthly/
File Pattern: Sales_*.xlsx
Authentication: Organizational Account

For APIs:

API Endpoint: https://api.company.com/v2/sales
Authentication Method: API Key / OAuth2
API Key Location: Stored in Azure Key Vault
Rate Limits: 100 requests/hour
Documentation: https://api.company.com/docs

3. Access Credentials

Document credential TYPE (not the actual passwords):

SQL Server: Uses SQL Authentication
  - Username: svc_powerbi (stored in gateway)
  - Password: Managed by IT Security team

SharePoint: Uses OAuth
  - Account: powerbi@company.com
  - MFA: Enabled

API: Uses Bearer Token
  - Token refresh: Every 24 hours
  - Token source: Azure AD

4. Dependencies and Requirements

On-Premises Gateway Required: Yes/No
Gateway Name: Production-Gateway-01
Firewall Rules: IP whitelist required
VPN Connection: Required for SQL Server access
File Permissions: Read-only access needed

5. Data Refresh Schedule

Source Data Updates: Daily at 6:00 AM
Power BI Refresh: Daily at 7:00 AM
Lag Time: 1 hour
Historical Data: 2 years rolling

Where to document:

Option 1: In Power BI Desktop (Recommended)

In Power Query:
1. Right-click query
2. Properties
3. Add detailed description

Example description:
"Source: SQL Server (sql-prod.company.com\SalesDB)
Table: dbo.FactSales
Updated: Daily at 6 AM
Owner: Data Team (data@company.com)
Contains: Sales transactions from Jan 2022 onwards
Refresh: Scheduled daily at 7 AM via Production Gateway"

Option 2: External Documentation Create a document (Word, Excel, OneNote, SharePoint) with a connection inventory:

| Query Name | Source Type | Connection String | Update Frequency | Owner | Notes |
|------------|-------------|-------------------|------------------|-------|-------|
| Sales      | SQL Server  | sql-prod.company  | Daily           | IT    | Prod  |
| Products   | SharePoint  | /sites/Sales/...  | Monthly         | Sales | ...   |

Option 3: README in project folder If storing .pbix file in OneDrive/SharePoint, include README.md:

# Sales Dashboard - Data Sources## Sales Data- **Source:** SQL Server
- **Server:** sql-prod.company.com
- **Database:** SalesDB
- **Table:** dbo.FactSales
- **Refresh:** Daily at 7 AM
## Product Master- **Source:** Excel file
- **Location:** SharePoint /sites/Sales/ProductMaster.xlsx
- **Owner:** sales.team@company.com
- **Update:** Monthly, 1st of month

Documentation Template:

PROJECT: Sales Performance Dashboard
LAST UPDATED: 2024-01-15
MAINTAINED BY: Your Name (your.name@company.com)

DATA SOURCE 1: SALES TRANSACTIONS
----------------------------------------
Type: SQL Server Database
Server: sql-prod-server.database.windows.net
Database: SalesDB
Schema: dbo
Table/View: FactSales
Authentication: SQL Server Authentication (via gateway)
Gateway: Production-Gateway-01
Update Frequency: Real-time (Direct Query) / Daily (Import)
Data Range: 2022-01-01 to Current
Row Count: ~5 million rows
Owner: IT Department (it.support@company.com)
Notes: Contains point-of-sale transactions

DATA SOURCE 2: PRODUCT MASTER
----------------------------------------
Type: Excel Workbook
Location: \\sharepoint\sites\Sales\SharedDocuments\Master Files\
File: ProductMaster.xlsx
Sheet: Products
Authentication: Organizational Account (powerbi@company.com)
Update Frequency: Monthly (1st of month by Sales Ops team)
Last Updated: 2024-01-01
Row Count: ~1,500 products
Owner: Sales Operations (sales.ops@company.com)
Notes: Maintained manually, includes product hierarchy

DATA SOURCE 3: CUSTOMER DATA
----------------------------------------
[Similar detail for each source...]

GATEWAY CONFIGURATION
----------------------------------------
Gateway Name: Production-Gateway-01
Gateway Server: gateway-server-01.company.local
Installed By: IT Infrastructure
Contact: it.gateway@company.com

REFRESH SCHEDULE
----------------------------------------
Schedule: Daily at 7:00 AM (timezone: UTC+0)
Expected Duration: 15-20 minutes
Alert Recipients: your.name@company.com, manager@company.com
Failure Response: Check gateway connectivity, verify SQL Server availability

CHANGE LOG
----------------------------------------
2024-01-15: Initial documentation created
2024-01-20: Added Customer Data source

Best Practices:

1. Update documentation when sources change Don’t let documentation become stale.

2. Include contact information Always note who to contact for each data source.

3. Note known issues

"Note: SQL Server goes offline for maintenance every Sunday 2-4 AM"
"File may be locked if Sales team is editing during business hours"

4. Document workarounds

"If SharePoint connection fails, use local copy at C:\Backup\"
"If API rate limit exceeded, increase interval between refreshes"

5. Version control for connection strings If moving between Dev/Test/Prod, document all versions:

DEV:  sql-dev.company.com
TEST: sql-test.company.com
PROD: sql-prod.company.com

7. Set up parameters for dynamic file paths or server connections

What this means:

Instead of hardcoding connection details (like server names, file paths, database names) directly into your queries, you store them in parameters that can be easily changed without editing every query.

Why this is crucial:

Problem without parameters:

You build a report connecting to:
Server: sql-dev-server.database.windows.net
Database: SalesDB_Dev

When moving to production, you need to:
1. Find every query connected to the dev server
2. Edit each connection manually
3. Change to: sql-prod-server.database.windows.net
4. Change database to: SalesDB_Prod
5. Hope you didn't miss any!

Solution with parameters:

You create parameters:
ServerName = sql-dev-server.database.windows.net
DatabaseName = SalesDB_Dev

All queries reference these parameters
To move to production:
1. Change ServerName parameter to sql-prod-server
2. Change DatabaseName parameter to SalesDB_Prod
3. Done! All queries automatically use new values

When to use parameters:

Server names (Dev vs Production)
Database names
File paths (local vs SharePoint vs network drive)
Date ranges (StartDate, EndDate for filtered queries)
API endpoints (different environments)
Folder locations (when using Folder connector)

How to create parameters:

Method 1: Create Parameter Manually

Step 1: Open Power Query Editor (Transform Data)
Step 2: Home tab → Manage Parameters → New Parameter
Step 3: Fill in parameter details:

Parameter Configuration:

Name: ServerName
Description: SQL Server name for database connection
Required: Yes (must have a value)
Type: Text
Suggested Values: Any value
Current Value: sql-dev-server.database.windows.net

Common Parameter Types:

1. Server/Connection Parameters:

Parameter: ServerName
Type: Text
Current Value: sql-prod-server.database.windows.net

Parameter: DatabaseName
Type: Text
Current Value: SalesDB

Parameter: Port
Type: Number
Current Value: 1433

2. File Path Parameters:

Parameter: FolderPath
Type: Text
Current Value: \\sharepoint\sites\Sales\Documents\

Parameter: FileName
Type: Text
Current Value: ProductMaster.xlsx

3. Date Range Parameters:

Parameter: StartDate
Type: Date
Current Value: 2024-01-01

Parameter: EndDate
Type: Date
Current Value: 2024-12-31

4. Environment Parameters:

Parameter: Environment
Type: Text
Suggested Values: List of values
  - Development
  - Testing
  - Production
Current Value: Production

How to use parameters in queries:

Example 1: SQL Server Connection

Without Parameters (Hardcoded):

Source = Sql.Database("sql-prod-server.database.windows.net", "SalesDB")

With Parameters:

Source = Sql.Database(ServerName, DatabaseName)

Step-by-step to replace with parameters:

1. In Power Query, find the Source step
2. Click the gear icon next to "Source" in Applied Steps
3. This opens the connection dialog
4. Instead of typing server name, click "fx" button
5. Select parameter from dropdown
6. Repeat for database name
7. Click OK

Example 2: File Path

Without Parameters:

Source = Excel.Workbook(File.Contents("C:\Users\YourName\Documents\Sales.xlsx"))

With Parameters:

Source = Excel.Workbook(File.Contents(FolderPath & FileName))

Here, & concatenates the folder path and file name.

Example 3: Folder Connector

Without Parameters:

Source = Folder.Files("\\sharepoint\sites\Sales\Reports\2024\")

With Parameters:

Source = Folder.Files(FolderPath)

Example 4: Date Filtering in SQL Query

Without Parameters:

Source = Sql.Database("server", "db", [Query="
    SELECT *
    FROM Sales
    WHERE OrderDate >= '2024-01-01'
      AND OrderDate <= '2024-12-31'
"])

With Parameters:

Source = Sql.Database(ServerName, DatabaseName, [Query="
    SELECT *
    FROM Sales
    WHERE OrderDate >= '" & Date.ToText(StartDate, "yyyy-MM-dd") & "'
      AND OrderDate <= '" & Date.ToText(EndDate, "yyyy-MM-dd") & "'
"])

Advanced Parameter Scenarios:

Scenario 1: Environment-Based Connection

Create a parameter that drives multiple connection strings:

Parameter: Environment
Values: Dev, Test, Prod

Then create a conditional query:
ServerName =
  if Environment = "Dev" then "sql-dev-server.database.windows.net"
  else if Environment = "Test" then "sql-test-server.database.windows.net"
  else "sql-prod-server.database.windows.net"

DatabaseName =
  if Environment = "Dev" then "SalesDB_Dev"
  else if Environment = "Test" then "SalesDB_Test"
  else "SalesDB"

Now you only change ONE parameter (Environment) and all connections update.

Scenario 2: Dynamic Date Parameters

Instead of manually setting dates, make them relative:

Parameter: MonthsBack
Type: Number
Current Value: 6

Then in query:
StartDate = Date.AddMonths(DateTime.LocalNow(), -MonthsBack)
EndDate = DateTime.LocalNow()

This automatically gives you last 6 months without manual updates.

Scenario 3: Multiple File Patterns

Parameter: FilePattern
Type: Text
Current Value: Sales_

Then in Folder query:
Source = Folder.Files(FolderPath),
FilteredRows = Table.SelectRows(Source, each Text.Contains([Name], FilePattern))

Change FilePattern to load different file sets.

Parameter Best Practices:

1. Naming Convention

Good:
- ServerName (clear what it is)
- FolderPath (descriptive)
- StartDate (obvious purpose)

Bad:
- Param1 (not descriptive)
- Server (too generic)
- S (unclear)

2. Add Descriptions Always fill in the Description field:

Name: ServerName
Description: SQL Server hostname for Sales database connection.
             Use sql-dev for development, sql-prod for production.

3. Use Suggested Values for limited options

Parameter: Environment
Suggested Values: List of values
  - Development
  - Testing
  - Production

This creates a dropdown instead of free text entry.

4. Document parameter dependencies

In query description:
"This query uses parameters: ServerName, DatabaseName, StartDate
 Change parameters before moving to production environment"

5. Group parameters logically Create a naming structure:

Environment parameters:
- ENV_ServerName
- ENV_DatabaseName

Date parameters:
- DATE_StartDate
- DATE_EndDate

File parameters:
- FILE_FolderPath
- FILE_Pattern

How Parameters Help in Deployment:

Development Environment:

ServerName = sql-dev-server.database.windows.net
DatabaseName = SalesDB_Dev
FolderPath = C:\Dev\Data\

Test Environment:

ServerName = sql-test-server.database.windows.net
DatabaseName = SalesDB_Test
FolderPath = \\test-share\Data\

Production Environment:

ServerName = sql-prod-server.database.windows.net
DatabaseName = SalesDB
FolderPath = \\sharepoint\sites\Production\Data\

Simply change the parameter values, refresh, and you’re in a different environment - no query editing required!

Parameters in Power BI Service:

After publishing to Power BI Service:

1. Go to workspace
2. Find your dataset
3. Settings → Parameters
4. Change parameter values
5. Apply
6. Refresh dataset

This lets you change connections without re-publishing the report!

8. Avoid loading unnecessary columns at this stage

What this means:

When connecting to a data source, only bring in the columns you actually need for your analysis. Leave behind anything that won’t be used in your report.

Why this is critical:

Performance Impact:

Scenario: SQL table with 50 columns, 10 million rows

Loading all 50 columns:
- Model size: 800 MB
- Refresh time: 15 minutes
- Query performance: Slow

Loading only 10 needed columns:
- Model size: 200 MB
- Refresh time: 4 minutes
- Query performance: Fast

Memory Impact: Power BI loads entire dataset into RAM. Unnecessary columns waste memory that could be used for actual analysis.

Maintainability: Fewer columns = cleaner model = easier to understand and maintain.

When to filter columns:

Best Time: During initial connection (BEFORE Power Query transformations)

When you connect to a source, you often see a “Navigator” or preview screen. This is your chance to select only needed columns.

How to select specific columns:

Method 1: During Initial Connection (SQL Server Example)

When connecting to SQL Server:
Step 1: Get Data → SQL Server
Step 2: Enter Server and Database
Step 3: Expand database → expand schema → see tables
Step 4: DON'T just click the checkbox next to table name
        (this selects all columns)
Step 5: Instead:
   - Click "Select Related Tables" dropdown
   - Click "Transform Data"
   - You'll see all columns
Step 6: In Power Query, before doing anything:
   - Home → Choose Columns
   - Uncheck columns you don't need
   - Click OK

Method 2: Using SQL Query (Advanced)

Instead of loading entire table, write a SELECT statement:

Instead of loading table "Sales" with all columns:

Write custom SQL:
SELECT
    OrderID,
    OrderDate,
    CustomerID,
    ProductID,
    Quantity,
    UnitPrice,
    TotalAmount
FROM dbo.Sales
WHERE OrderDate >= '2022-01-01'

This approach: - Only loads specified columns - Can also filter rows in the database (faster) - Takes advantage of query folding

How to use SQL query:

When connecting to SQL Server:
1. Advanced options → SQL statement
2. Enter your SELECT query
3. Click OK

Method 3: In Power Query Editor

If columns are already loaded:

Option A: Choose Columns
1. Home tab → Choose Columns
2. Uncheck unnecessary columns
3. Click OK

Option B: Remove Columns
1. Select columns you DON'T want (Ctrl+Click for multiple)
2. Right-click → Remove Columns
3. Faster than Choose Columns if removing just a few

Option C: Remove Other Columns
1. Select columns you WANT to keep (Ctrl+Click)
2. Right-click → Remove Other Columns
3. All non-selected columns are removed

Which columns to remove:

Definitely Remove:

1. Internal system columns

- CreatedBy (system user who created record)
- ModifiedBy (system user who updated record)
- CreatedTimestamp (not needed if you have OrderDate)
- ModifiedTimestamp (rarely needed in reporting)
- RowVersion (for database concurrency)
- RecordID (internal tracking)

2. Redundant columns

If you have FirstName and LastName, you don't need:
- FullName (can create this in Power Query if needed)

If you have Quantity and UnitPrice, you don't need:
- TotalAmount (calculate with DAX measure instead)

3. Overly detailed columns you won’t use

From Customer table:
- Remove: BillingAddress2, BillingAddress3, MiddleName, Suffix
- Keep: CustomerID, CustomerName, City, State, Country

4. Binary/blob columns

- Photos, images (unless specifically needed)
- Attachments
- Large text fields (notes, comments) if not used

5. Denormalized columns (that will come from dimension tables)

In Sales fact table:
- Remove: ProductName, ProductCategory, ProductColor
- Keep: ProductID only
- Why: You'll get product details from Product dimension table

Columns to ALWAYS keep:

1. Keys for relationships

In Sales table:
- Keep: OrderID, CustomerID, ProductID, DateKey, StoreID
- These create relationships to dimension tables

2. Metrics/measures (numbers to analyze)

- Quantity
- UnitPrice
- DiscountAmount
- TaxAmount
- TotalAmount

3. Date/time fields

- OrderDate
- ShipDate
- DeliveryDate

4. Status/flag fields if used in filtering

- OrderStatus (Pending, Shipped, Delivered)
- IsCancelled
- PaymentStatus

Example Comparison:

Bad: Loading Everything

Sales Table (loaded all 30 columns):
OrderID, OrderDate, OrderTime, CreatedDate, CreatedBy, ModifiedDate,
ModifiedBy, CustomerID, CustomerName, CustomerEmail, CustomerPhone,
CustomerAddress1, CustomerAddress2, CustomerCity, CustomerState,
ProductID, ProductName, ProductCategory, ProductSubcategory, Quantity,
UnitPrice, DiscountPercent, DiscountAmount, TaxPercent, TaxAmount,
ShippingCost, TotalAmount, PaymentMethod, OrderStatus, InternalNotes

Result:
- Model size: 500 MB
- Refresh: 10 minutes
- Cluttered field list

Good: Loading Only What’s Needed

Sales Table (loaded 8 columns):
OrderID, OrderDate, CustomerID, ProductID, Quantity, UnitPrice,
OrderStatus, TotalAmount

Result:
- Model size: 120 MB
- Refresh: 3 minutes
- Clean field list
- Other details come from Customer and Product dimension tables

Special Case: Choose Columns vs Remove Columns

Use “Choose Columns” when: - You know exactly which columns you need - You want to keep only 5-10 columns from a table with 50 columns - Starting fresh with a clear list

Use “Remove Columns” when: - You want to keep most columns and remove just a few - You’re removing 3-5 specific columns from a 20-column table

Use “Remove Other Columns” when: - You want to keep specific columns and don’t care what else exists - Working with unfamiliar tables and found the columns you need

Column Selection Patterns by Table Type:

Fact Tables (Transaction Tables):

Keep:
✓ All foreign keys (IDs for relationships)
✓ Date columns
✓ Numeric measures (quantities, amounts)
✓ Status flags (if used for filtering)

Remove:
✗ Descriptive text from related tables
✗ Calculated fields that can be DAX measures
✗ Audit columns (CreatedBy, ModifiedDate)
✗ Internal system columns

Dimension Tables (Lookup Tables):

Keep:
✓ Primary key (ID)
✓ Display name/description
✓ Category/group fields
✓ Sort order columns (if needed)
✓ Active/inactive status

Remove:
✗ Detailed descriptions (unless used in report)
✗ Internal codes (unless needed)
✗ Audit columns
✗ Redundant text columns

Query Folding Consideration:

Important: If you remove columns in Power Query, check if query folding still works.

Query folding = Power Query pushes transformations back to the data source (database does the work, not Power BI).

Check query folding:

1. Right-click on last step in Applied Steps
2. If "View Native Query" is available → Query folding works ✓
3. If greyed out → Query folding broken ✗

Best practice: Remove columns using “Choose Columns” or “Remove Columns” maintains query folding. More complex transformations might break it.

The Selection Process:

Before loading data, ask yourself for each column:

1. Will this column be used in any visual? NO → Remove
2. Is this column needed for a relationship? NO → Remove
3. Is this column needed for filtering? NO → Remove
4. Will this column be used in any calculation? NO → Remove
5. Is this column for context/tooltip? NO → Remove

If answer is NO to all → Remove the column!

Performance Testing:

After removing columns, check the improvement:

Before:
- Query: [Sales - All Columns]
- Refresh time: 12 minutes
- File size: 450 MB

After removing unnecessary columns:
- Query: [Sales - Optimized]
- Refresh time: 4 minutes
- File size: 180 MB

Improvement: 67% faster, 60% smaller

Summary: The Data Loading Checklist

Before you click “Close & Apply” in Power Query, verify:

☑ All necessary data sources identified
☑ Appropriate connectors used for each source
☑ Data loaded to Power Query Editor (not directly to model)
☑ Data preview checked (types, quality, structure)
☑ Folder/file filters applied (if using Folder connector)
☑ Source connections documented (who, what, where, when)
☑ Parameters created for dynamic connections
☑ Unnecessary columns removed
☑ Each query named clearly
☑ Ready to start transformations

This detailed breakdown should give you a complete understanding of each data loading step. Each principle here helps you build a more efficient, maintainable, and performant Power BI solution from the very beginning.