📚

Download data from Kaggle

import kagglehub
import shutil
import os

def download_kaggle_dataset(dataset_slug, destination_dir):
    # Download dataset to the default location
    path = kagglehub.dataset_download(dataset_slug)

    # Ensure the destination directory exists
    os.makedirs(destination_dir, exist_ok=True)

    # Move files to the preferred directory
    if os.path.exists(path):
        shutil.move(path, destination_dir)
        print(f"Dataset '{dataset_slug}' moved to:", destination_dir)
    else:
        print("Download path does not exist.")
    
    return destination_dir

# Example usage
dataset_path = download_kaggle_dataset("timchant/supstore-dataset-2019-2022", "/Users/teslim/OneDrive/mlzoomcamp")
print("Dataset is available at:", dataset_path)

Each time you want to use this standard function to download a new Kaggle dataset to a specific location, you’ll only need to modify two parameters in the function call:

  1. dataset_slug: This is the unique identifier for the Kaggle dataset you want to download. It follows the format:
  2. arduino
    Copy code
    "owner/dataset-name"
    
    

    For example, "timchant/supstore-dataset-2019-2022" in your original code.

  3. destination_dir: This is the directory path on your local machine where you want the dataset saved. Change this path to organize datasets by project or location.

Here’s a quick step-by-step guide for using the function with these changes:

Step-by-Step Guide

  1. Identify the Dataset Slug:
    • Go to the Kaggle dataset you want to download. The slug is part of the URL:
    • arduino
      Copy code
      https://www.kaggle.com/dataset-owner/dataset-name
      
      
    • For example, for the dataset located at https://www.kaggle.com/timchant/supstore-dataset-2019-2022, the slug is:
    • python
      Copy code
      dataset_slug = "timchant/supstore-dataset-2019-2022"
      
      
  2. Choose Your Destination Directory:
    • Decide where you want the dataset to be saved. You might organize it within a project folder or a general data folder.
    • Specify the full path or a relative path based on your project setup:
    • python
      Copy code
      destination_dir = "/path/to/your/project/data"
      
      
    • Replace "/path/to/your/project/data" with the actual path you want.
  3. Run the Function with Updated Parameters:
    • Call the function with the new dataset slug and destination directory.
    • python
      Copy code
      dataset_path = download_kaggle_dataset("timchant/supstore-dataset-2019-2022", "/Users/teslim/OneDrive/mlzoomcamp")
      print("Dataset is available at:", dataset_path)
      
      

Example Usage

Suppose you want to download a different dataset, such as "zynicide/wine-reviews", and save it to a folder named "wine_analysis" inside your documents directory. Here’s how you’d adjust the function call:

python
Copy code
# Define new dataset slug and destination directory
dataset_slug = "zynicide/wine-reviews"
destination_dir = "/Users/teslim/Documents/wine_analysis"

# Download the dataset
dataset_path = download_kaggle_dataset(dataset_slug, destination_dir)
print("Dataset is available at:", dataset_path)

Summary of What You’ll Change Each Time

  1. dataset_slug – Update to the slug of the dataset you want to download.
  2. destination_dir – Update to the directory where you want the dataset saved.

With these two simple changes, you can easily use this function to download any Kaggle dataset to your preferred directory.