📂

Combining Excel and CSV Files into One Sheet Using Pandas

1. Introduction

This study note explains how to combine multiple Excel and CSV files into a single dataset using pandas, with a focus on speed and efficiency. It also covers the best engines for reading and writing files, and how to save the combined result in different formats for future use. Combining data from different files into one dataset is a common task in data analytics. Efficiently handling such operations helps improve workflow speed, especially when dealing with large datasets or when preparing data for exploratory analysis and model training.

2. Choosing the Right Engine

When working with multiple file types such as .xlsx and .csv, pandas offers different engines to handle file reading and writing operations. The choice of engine directly affects speed, memory usage, and compatibility.

Below is a comparison of the commonly used engines.

Engine	File Type	Speed	Description
pyarrow	Excel (.xlsx) and CSV	Very Fast	Uses Apache Arrow’s vectorized backend for high-speed I/O operations. It is available from pandas version 2.1 and above.
openpyxl	Excel (.xlsx)	Medium	Default engine for Excel files. Reliable and widely supported, but slower than `pyarrow`.
xlsxwriter	Excel (.xlsx) (writing only)	Medium	Used only for writing Excel files. Produces lightweight, compatible output files.
odf	ODS files	Slow	XML-based engine for OpenDocument spreadsheets. Generally slower than Excel formats.
Built-in CSV Engine	CSV files	Very Fast	Uses optimized C parser or Arrow backend for CSV reading. Handles large CSVs efficiently.

Conclusion:

The pyarrow engine is the fastest option for reading both Excel and CSV files. It is also memory-efficient and integrates well with pandas’ internal Arrow backend. 3. Installation of Required Packages Before working with these engines, install the necessary libraries.

3. Installation of Required Packages

Before working with these engines, install the necessary libraries.

pip install pyarrow openpyxl

pyarrow enables fast data reading/writing for CSV and Excel files.
openpyxl ensures compatibility with Excel when Arrow is unavailable or for writing back to .xlsx files.

4. Combining All Excel and CSV Files into a Single Sheet

This section demonstrates how to merge all Excel and CSV files located in the same folder into one combined dataset using pandas.

Step 1: Import Required Libraries

import pandas as pd
import glob
import os

These libraries handle:

pandas: data manipulation and combination.
glob: locating multiple files by pattern (e.g., all .csv or .xlsx files).
os: file path operations.

Step 2: Collect All Excel and CSV Files

files = glob.glob("*.xlsx") + glob.glob("*.csv")

This line searches the current directory and lists all files ending with .xlsx or .csv. The resulting list, files, contains the paths of all identified files.

Step 3: Read Each File and Store in a List

combined_data = []

for file in files:
    ext = os.path.splitext(file)[1].lower()

    if ext == ".csv":
        df = pd.read_csv(file, engine="pyarrow")  # Fast CSV reader
    elif ext == ".xlsx":
        df = pd.read_excel(file, engine="pyarrow")  # Fast Excel reader
    else:
        continue

    # Add filename for traceability
    df["source_file"] = os.path.basename(file)

    combined_data.append(df)

Explanation:

The code loops through each file, checking its extension.
Depending on whether it’s a CSV or Excel file, it uses pd.read_csv() or pd.read_excel() with the pyarrow engine.
A new column source_file is added to record the file name, which is helpful for tracing data origins.
Each dataframe is appended to the combined_data list.

Step 4: Combine All Dataframes

final_df = pd.concat(combined_data, ignore_index=True)

This concatenates all dataframes in combined_data into a single dataframe named final_df.

The parameter ignore_index=True resets the row indices to create a continuous index across all merged data.

Step 5: Save Combined Data to Excel

final_df.to_excel("all_combined.xlsx", sheet_name="Merged_Data", index=False, engine="xlsxwriter")

Explanation:

The final merged dataset is written to a single Excel file named all_combined.xlsx.
The parameter sheet_name="Merged_Data" specifies the worksheet name.
The xlsxwriter engine ensures faster and reliable writing performance.
index=False prevents pandas from writing row indices to the file.

Result:

All .csv and .xlsx files are merged into one Excel sheet.
A new column, source_file, identifies the origin of each record.
The process executes quickly due to the pyarrow reading engine.

5. Converting Combined Data to Parquet Format for Faster Access

While Excel files are human-readable, they are not the most efficient for large-scale analytical work.

For better performance, the combined dataset can be stored in Parquet format, which is optimized for both speed and compression.

Step 1: Save Data as Parquet

final_df.to_parquet("all_combined.parquet", engine="pyarrow")

This saves the dataframe in a binary, columnar format using Apache Arrow.

Parquet format reduces file size and allows faster read/write operations compared to CSV or Excel.

Step 2: Load Parquet File for Future Use

df = pd.read_parquet("all_combined.parquet", engine="pyarrow")

This loads the saved Parquet file back into pandas almost instantly.

It’s the recommended method when repeatedly accessing large datasets during analysis or model building.

6. Summary

Task	Recommended Engine	Reason
Read Excel files	`pyarrow`	Fast and memory-efficient
Write Excel files	`xlsxwriter`	Reliable and lightweight for writing
Read CSV files	`pyarrow`	Uses vectorized Arrow backend
Save for future use	`pyarrow (Parquet)`	Fastest and most space-efficient storage

7. Key Notes

The pyarrow engine provides the fastest reading performance for both CSV and Excel files.
The xlsxwriter engine is used for writing Excel files but cannot read them.
Combining datasets into a single sheet is useful for quick access and unified analysis during pandas practice.
For continuous use and faster loading, converting to Parquet format is recommended.
Parquet files are particularly suitable for analytical workloads, machine learning pipelines, and cloud storage.

8. Practical Use Case Example

This workflow is ideal for practice scenarios where:

Multiple departmental reports exist as separate CSV and Excel files.
The analyst wants to unify them into a single dataset for aggregation, transformation, and exploratory data analysis in pandas.
After processing, the result can be saved as either an Excel file for reporting or a Parquet file for performance optimization.

9. Conclusion

Efficiently combining Excel and CSV files into one dataset allows faster experimentation and cleaner data management.

Using the pyarrow engine ensures optimal speed, while xlsxwriter provides flexibility for exporting results.

For long-term or repeated use, Parquet offers the best balance between performance and storage efficiency.

This setup serves as a foundational workflow for advanced pandas operations and future data engineering tasks.

How to Add Another File to an Existing Parquet Dataset

1. Read the Existing Parquet File

Start by loading your current Parquet dataset into memory.

import pandas as pd

df = pd.read_parquet("all_combined.parquet", engine="pyarrow")

This retrieves the existing data into a pandas DataFrame called df.

2. Read the New File You Want to Add

You can add a new file of any format — CSV or Excel.

Simply read it using the same Arrow engine for consistency.

# Example: new CSV file
new_data = pd.read_csv("new_file.csv", engine="pyarrow")

# OR, if it’s an Excel file
# new_data = pd.read_excel("new_file.xlsx", engine="pyarrow")

Optionally, you can add a column identifying the file name for traceability:

new_data["source_file"] = "new_file.csv"

3. Combine the New Data with the Existing Dataset

Use pd.concat() to append the new rows.

updated_df = pd.concat([df, new_data], ignore_index=True)

ignore_index=True resets the index after concatenation.
Both DataFrames should have matching column names; if not, missing columns will be filled with NaN.

4. Save the Updated Data Back to Parquet

Finally, overwrite or version your Parquet file with the new combined data.

updated_df.to_parquet("all_combined.parquet", engine="pyarrow")

✅ This overwrites the existing file with the updated version.

If you prefer version control (keeping backups), save with a new name:

updated_df.to_parquet("all_combined_v2.parquet", engine="pyarrow")

5. (Optional) Append Without Loading All Data (Efficient Method)

If your dataset is very large, instead of loading everything into memory, you can use the partitioned Parquet datasetapproach.

Step 1: Save Each Batch Separately

new_data.to_parquet("parquet_dataset/new_file.parquet", engine="pyarrow")

Step 2: Load All Partitions Together

import pyarrow.dataset as ds

dataset = ds.dataset("parquet_dataset", format="parquet")
df = dataset.to_table().to_pandas()

This automatically merges all .parquet files inside the folder parquet_dataset/ without concatenating manually.

It’s faster and more memory-efficient for incremental additions.

6. Summary

Step	Description	Command
1	Read existing Parquet file	`pd.read_parquet("all_combined.parquet")`
2	Read new CSV or Excel file	`pd.read_csv()` or `pd.read_excel()`
3	Combine old and new data	`pd.concat([df, new_data])`
4	Save combined data	`to_parquet("all_combined.parquet")`
5	For large datasets	Use partitioned folder approach with `pyarrow.dataset`

7. Example Workflow

import pandas as pd

# Step 1: Read the existing Parquet file
df = pd.read_parquet("all_combined.parquet", engine="pyarrow")

# Step 2: Read a new CSV file
new_data = pd.read_csv("sales_update.csv", engine="pyarrow")
new_data["source_file"] = "sales_update.csv"

# Step 3: Combine the two DataFrames
updated_df = pd.concat([df, new_data], ignore_index=True)

# Step 4: Save the updated data
updated_df.to_parquet("all_combined.parquet", engine="pyarrow")

8. Notes

Always use the same column names and data types to ensure smooth merging.
If you plan to update data frequently, use partitioned Parquet storage in a folder instead of one big file.
Parquet supports schema evolution, but it’s better to maintain consistent structure across files.