Exercise - Deploy the SDOH datasets - transformations capability

Completed

In this exercise, you deploy the social determinants of health (SDOH) datasets – transformations capability into your workspace that already has healthcare data solutions in Fabric deployed. You upload a prebuilt dataset from the sample folder to the Ingest folder in the bronze lakehouse. Then, the pipeline picks this data for transformation. Finally, you run the SDOH ingestion pipeline to transform the file that you uploaded.

Prerequisites

To complete this exercise, you must have a Fabric workspace with healthcare data solutions deployed. For more information, see Exercise - Deploy healthcare data solutions in Microsoft Fabric with sample data.

Deploy the SDOH datasets - transformations capability

To deploy the SDOH datasets - transformations capability, follow these steps:

  1. Go to Microsoft Power BI and sign in.

  2. Select the FL HC Cloud Fabric workspace where you deployed the healthcare data solutions in Fabric, as mentioned in Exercise - Deploy healthcare data solutions in Microsoft Fabric with sample data.

  3. Locate and select the FLHealthcare healthcare data solution that you deployed.

    Screenshot of the healthcare data solution that you deployed selected.

  4. Under Solution capabilities and sample data, select SDOH datasets - transformations.

    Screenshot of the SDOH datasets – transformations capability selected.

  5. Select Deploy to workspace.

  6. Wait for the deployment to complete. After the deployment completes, select the workspace that you created.

  7. Locate and open the FLHealthcare_msft_bronze lakehouse.

    Screenshot of the bronze lakehouse selected in your workspace.

  8. Select Open notebook > New notebook.

    Screenshot of the Open notebook and New notebook options selected.

  9. Select Workspace default and change the environment to FLHealthcare_environment.

    Screenshot of the FL Healthcare environment selected.

  10. Add the following code to the existing code block. This code runs the configuration notebook to initialize the values for the copy script. Replace FL with your initials.

    %run FLHealthcare_msft_config_notebook

  11. Select Add code cell.

    Screenshot of the Add code cell option selected.

  12. Add the following script to copy the United States Department of Agriculture (USDA) Food Environment Atlas, USDA Rural Atlas, and the USDA Agency for Healthcare Research and Quality (AHRQ) source datasets to the Ingest folder. This script recursively copies files and folders from the sample data source directory in the bronze lakehouse storage location. The system copies the files and folders to the correct ingest path, ensuring that it creates directories if they don't exist. If the directories are already present, the files are skipped. While only these three datasets are required for the exercise, a script is also available for you to copy all prebuilt datasets for ingestion.

    bronze_lakehouse_files  = "[Files ABFS path]"
    
    def copy_source_files_and_folders(source_path, destination_path):
    # List the contents of the source directory
    source_contents = mssparkutils.fs.ls(source_path)
    
    # List the contents of the destination directory
    try:
        destination_contents = mssparkutils.fs.ls(destination_path)
        destination_files = {item.path.split('/')[-1]: item.path for item in destination_contents}
    except Exception as e:
        print(f"Destination path {destination_path} does not exist or is empty. Creating the path.")
        destination_files = {}
        mssparkutils.fs.mkdirs(destination_path)
    
    # Copy each item inside the source directory to the destination directory
    for item in source_contents:
        item_path = item.path
        item_name = item_path.split('/')[-1]
        destination_item_path = f"{destination_path}/{item_name}"
    
        if item.isDir:
            # Recursively copy the contents of the directory
            copy_source_files_and_folders(item_path, destination_item_path)
        else:
            if item_name in destination_files:
                print(f"File already exists, skipping: {destination_item_path}")
            else:
                print(f"Creating new file: {destination_item_path}")
                mssparkutils.fs.cp(item_path, destination_item_path, recurse=True)
    
    # Define the source and destination paths with placeholder values
    data_manager_sample_data_path = f"{bronze_lakehouse_files}/SampleData/SDOH"
    bronze_lakehouse_ingest = f"{bronze_lakehouse_files}/Ingest"
    
    # Copy the files for the Food Environment Atlas
    sdoh_xlsx_data_path = f"{data_manager_sample_data_path}/XLSX/USDA/FoodEnvironmentAtlas"
    destination_path_xlsx = f"{bronze_lakehouse_ingest}/SDOH/XLSX/USDA/FoodEnvironmentAtlas"
    copy_source_files_and_folders(sdoh_xlsx_data_path, destination_path_xlsx)
    
    
    # Copy the files for the Rural Atlas
    sdoh_xlsx_data_path = f"{data_manager_sample_data_path}/XLSX/USDA/RuralAtlas"
    destination_path_xlsx = f"{bronze_lakehouse_ingest}/SDOH/XLSX/USDA/RuralAtlas"
    copy_source_files_and_folders(sdoh_xlsx_data_path, destination_path_xlsx)
    
    
    # Copy the files for AHRQ
    sdoh_xlsx_data_path = f"{data_manager_sample_data_path}/XLSX/USAHRQ/AHRQ"
    destination_path_xlsx = f"{bronze_lakehouse_ingest}/SDOH/XLSX/USAHRQ/AHRQ"
    copy_source_files_and_folders(sdoh_xlsx_data_path, destination_path_xlsx)
    
    
  13. On the left pane, select Lakehouses and the bronze lakehouse.

  14. Expand Files, select the ellipsis (...) button on the Files node, and then select Copy ABFS path.

    Screenshot of the Copy ABFS path option.

  15. Replace [Files ABFS path] with the path that you copied.

    Screenshot of the ingest script showing the Ingest ABFS path replaced.

  16. Select Run all and then wait for the run to complete.

  17. Expand the Ingest > SDOH > XLSX > USAHRQ and USDA nodes. The copied datasets should display.

    Screenshot of the USDA node selected with the copied datasets.

Run the SDOH ingestion pipeline

To run the SDOH ingestion pipeline, follow these steps:

  1. Select the FL HC Cloud workspace that you created.

  2. Locate and select the FLHealthcare_msft_sdoh_ingestion data pipeline. Replace FL with your initials.

    Screenshot of the SDOH data pipeline selected.

  3. Review the pipeline. This pipeline copies the files from the Ingest folder and prepares them for ingestion into the bronze lakehouse. Then, it flattens the data from the bronze lakehouse into the silver lakehouse.

  4. Select Run and then wait for the pipeline run to complete.

    Screenshot of the notebooks in the pipeline and the Run button selected.

The pipeline run should complete successfully. Refresh if the activity status doesn't change after a while.