Share via


Overview of SharePoint ingestion setup

Learn about the supported authentication methods for SharePoint ingestion into Azure Databricks.

Important

The managed SharePoint connector is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Azure Databricks previews.

Tip

This page covers the managed SharePoint connector for ingesting unstructured files (PDFs, DOCX, and more) for use in applications such as RAG.

To build custom pipelines with the SharePoint connector, providing full control over parsing, transformations, and ingestion of both structured files (for example, CSV and Excel) and unstructured files into Delta tables, see Ingest files from SharePoint.

Choose your SharePoint connector

Lakeflow Connect offers two complementary SharePoint connectors. They both access data in SharePoint, but they support distinct goals.

Consideration Managed SharePoint connector Standard SharePoint connector
Management and customization A fully-managed connector.
Simple, low-maintenance connectors for enterprise applications that ingest data in to Delta tables and keep them in sync with the source. See Managed connectors in Lakeflow Connect.
Build custom ingestion pipelines with SQL, PySpark, or Lakeflow Spark Declarative Pipelines using batch and streaming APIs such as read_files, spark.read, COPY INTO, and Auto Loader.
Offers the flexibility to perform complex transformations during ingestion, while giving you greater responsibility for managing and maintaining your pipelines.
Output format Uniform binary content table. Ingests each file in binary format (one file per row), along with file metadata in
additional columns.
Structured Delta tables. Ingests structured files (like CSV and Excel) as Delta tables. Can also be used to ingest
unstructured files in binary format.
Granularity, filtering, and selection No subfolder or file level selection today. No pattern-based filtering.
Ingests all files in the specified SharePoint document library.
Granular and custom.
URL-based selection to ingest from document libraries, subfolders, or individual files. Also supports pattern-based filtering using the pathGlobFilter option.

Which authentication methods are supported?

The SharePoint connector supports the following authentication methods:

Which authentication method should I choose?

In most scenarios, Databricks recommends machine-to-machine (M2M) OAuth. M2M scopes connector permissions to a specific site. However, if you want to scope permissions to whatever the authenticating user can access, choose user-to-machine (U2M) OAuth instead. Both methods offer automated token refresh and heightened security.

Manual token refresh authentication is considered a legacy method and is not recommended.

U2M compared to M2M

The following table compares U2M and M2M for authentication to SharePoint:

Feature OAuth U2M OAuth M2M
Authentication type Delegated access (user-based) App-only access (service principal)
User interaction required Yes - User must sign in No - Fully automated
Best for User-specific access scenarios Automated production pipelines
Token refresh Handled automatically by Azure Databricks Handled automatically by Azure Databricks
SharePoint permissions Delegated permissions Application permissions
Access scope Limited to user's permissions Defined by app registration