Exercise - Deploy the SDOH datasets - transformations capability

8 minutes

In this exercise, you deploy the social determinants of health (SDOH) datasets – transformations capability into your workspace that already has healthcare data solutions in Fabric deployed. You upload a prebuilt dataset from the sample folder to the Ingest folder in the bronze lakehouse. Then, the pipeline picks this data for transformation. Finally, you run the SDOH ingestion pipeline to transform the file that you uploaded.

Prerequisites

To complete this exercise, you must have a Fabric workspace with healthcare data solutions deployed. For more information, see Exercise - Deploy healthcare data solutions in Microsoft Fabric with sample data.

Deploy the SDOH datasets - transformations capability

To deploy the SDOH datasets - transformations capability, follow these steps:

Go to Microsoft Power BI and sign in.
Select the FL HC Cloud Fabric workspace where you deployed the healthcare data solutions in Fabric, as mentioned in Exercise - Deploy healthcare data solutions in Microsoft Fabric with sample data.
Locate and select the FLHealthcare healthcare data solution that you deployed.
Under Solution capabilities and sample data, select SDOH datasets - transformations.
Select Deploy to workspace.
Wait for the deployment to complete. After the deployment completes, select the workspace that you created.
Locate and open the FLHealthcare_msft_bronze lakehouse.
Select Open notebook > New notebook.
Select Workspace default and change the environment to FLHealthcare_environment.
Add the following code to the existing code block. This code runs the configuration notebook to initialize the values for the copy script. Replace FL with your initials.

%run FLHealthcare_msft_config_notebook
Select Add code cell.

Add the following script to copy the United States Department of Agriculture (USDA) Food Environment Atlas, USDA Rural Atlas, and the USDA Agency for Healthcare Research and Quality (AHRQ) source datasets to the Ingest folder. This script recursively copies files and folders from the sample data source directory in the bronze lakehouse storage location. The system copies the files and folders to the correct ingest path, ensuring that it creates directories if they don't exist. If the directories are already present, the files are skipped. While only these three datasets are required for the exercise, a script is also available for you to copy all prebuilt datasets for ingestion.

bronze_lakehouse_files  = "[Files ABFS path]"

def copy_source_files_and_folders(source_path, destination_path):
# List the contents of the source directory
source_contents = mssparkutils.fs.ls(source_path)

# List the contents of the destination directory
try:
    destination_contents = mssparkutils.fs.ls(destination_path)
    destination_files = {item.path.split('/')[-1]: item.path for item in destination_contents}
except Exception as e:
    print(f"Destination path {destination_path} does not exist or is empty. Creating the path.")
    destination_files = {}
    mssparkutils.fs.mkdirs(destination_path)

# Copy each item inside the source directory to the destination directory
for item in source_contents:
    item_path = item.path
    item_name = item_path.split('/')[-1]
    destination_item_path = f"{destination_path}/{item_name}"

    if item.isDir:
        # Recursively copy the contents of the directory
        copy_source_files_and_folders(item_path, destination_item_path)
    else:
        if item_name in destination_files:
            print(f"File already exists, skipping: {destination_item_path}")
        else:
            print(f"Creating new file: {destination_item_path}")
            mssparkutils.fs.cp(item_path, destination_item_path, recurse=True)

# Define the source and destination paths with placeholder values
data_manager_sample_data_path = f"{bronze_lakehouse_files}/SampleData/SDOH"
bronze_lakehouse_ingest = f"{bronze_lakehouse_files}/Ingest"

# Copy the files for the Food Environment Atlas
sdoh_xlsx_data_path = f"{data_manager_sample_data_path}/XLSX/USDA/FoodEnvironmentAtlas"
destination_path_xlsx = f"{bronze_lakehouse_ingest}/SDOH/XLSX/USDA/FoodEnvironmentAtlas"
copy_source_files_and_folders(sdoh_xlsx_data_path, destination_path_xlsx)


# Copy the files for the Rural Atlas
sdoh_xlsx_data_path = f"{data_manager_sample_data_path}/XLSX/USDA/RuralAtlas"
destination_path_xlsx = f"{bronze_lakehouse_ingest}/SDOH/XLSX/USDA/RuralAtlas"
copy_source_files_and_folders(sdoh_xlsx_data_path, destination_path_xlsx)


# Copy the files for AHRQ
sdoh_xlsx_data_path = f"{data_manager_sample_data_path}/XLSX/USAHRQ/AHRQ"
destination_path_xlsx = f"{bronze_lakehouse_ingest}/SDOH/XLSX/USAHRQ/AHRQ"
copy_source_files_and_folders(sdoh_xlsx_data_path, destination_path_xlsx)

On the left pane, select Lakehouses and the bronze lakehouse.
Expand Files, select the ellipsis (...) button on the Files node, and then select Copy ABFS path.
Replace [Files ABFS path] with the path that you copied.
Select Run all and then wait for the run to complete.
Expand the Ingest > SDOH > XLSX > USAHRQ and USDA nodes. The copied datasets should display.

Run the SDOH ingestion pipeline

To run the SDOH ingestion pipeline, follow these steps:

Select the FL HC Cloud workspace that you created.
Locate and select the FLHealthcare_msft_sdoh_ingestion data pipeline. Replace FL with your initials.
Review the pipeline. This pipeline copies the files from the Ingest folder and prepares them for ingestion into the bronze lakehouse. Then, it flattens the data from the bronze lakehouse into the silver lakehouse.
Select Run and then wait for the pipeline run to complete.

The pipeline run should complete successfully. Refresh if the activity status doesn't change after a while.

Feedback

Was this page helpful?