Deploying the Purview Information Protection (PIP) Scanner

 
All Guides and Articles
List view
 

Introduction


Information Protection and Data Loss Prevention capabilities can extend beyond traditional Microsoft 365 workloads, enabling organizations to discover and safeguard sensitive data at rest across on-premises file shares and SharePoint environments.
This guide will walk you through deploying the Purview Information Protection (PIP) Scanner to discover and label sensitive data stored on on-premises file shares and repositories. It summarizes the key prerequisites for servers, service accounts, SQL, permissions, and connectivity.
All prerequisites are outlined in Microsoft’s official documentation here: Get started with the information protection scanner

Prerequisites

  • Azure & licensing
    • An Azure tenant (Microsoft Entra ID) and appropriate Microsoft Purview / Information Protection licensing.
      • Microsoft 365 E3/E5
      • Microsoft Business Premium
      • E5 Compliance add-on / Purview Suite add-on
  • Information Protection Requirements
    • You must have at least one sensitivity label configured in the Microsoft Purview portal for the scanner account, to apply classification and, optionally, encryption.
    • It’s recommended to publish the sensitivity label you intend to use to your Service Account.
  • Permissions
    • You must have a service account to run the scanner service on the Windows Server computer, as well as authenticate to Microsoft Entra ID and download the scanner's policy.
    • Your service account must be an Active Directory account and synchronized to Microsoft Entra ID.
    • Full Control” & “Site Collector Auditor Rights” permissions in SharePoint On-Premise
    • Allow log on locally” user right assignment on the server (Local Security Policy > Local Policies > User Rights Assignment > Allow log on locally):
    • notion image
    • Log on as a service” user right assignment on the server (Local Security Policy > Local Policies > User Rights Assignment >Log on as a service):
    • notion image
  • Scanner server
    • A dedicated Windows Server machine (or VM) to host the scanner service, with network connectivity to:
      • On-prem file shares and repositories to be scanned
      • The SQL Server hosting the scanner database
      • A domain account to run the scanner service (used for access to on-prem content sources and required permissions).
  • SQL Server
    • A supported SQL Server instance for the scanner configuration database, plus permissions to create and manage the scanner database.
    • SQL Server 2016 and above
      • Includes SQL Server Standard & Enterprise
      • For small servers or testing only, SQL Server Express can be used
  • Network/proxy requirements (if applicable)
    • Ensure outbound connectivity and/or proxy configurations can reach Microsoft endpoints.
  • Certificates (if applicable)
    • Any required TLS/SSL certificates for secure connectivity, depending on your environment and proxy/inspection configuration.

Configure the Scanner Settings in Purview

To configure your scanner in the Microsoft Purview portal:
  1. Sign in using one of the following roles:
      • Compliance Administrator
      • Compliance Data Administrator
      • Security Administrator
      • Organization Management
  1. Sign in to the Microsoft Purview portal > Settings card > Information Protection > Information protection scanner
  1. Create a scanner cluster. This cluster defines your scanner and is used to identify the scanner instance, such as during installation, upgrades, and other processes. You will use this name later to identify where you want to install or upgrade your scanner.
💡
IMPORTANT
The scanner supports clusters with multiple nodes, enabling your organization to scale out, achieving faster scan times and broader scope. However, only a single node can be present on a server at one time. More information on Nodes can be found below.
notion image
  1. Create a content scan job to define the repositories you want to scan.
The content scan job is a crucial piece when deploying labels on-premises. It allows you to specify which repositories to scan, frequency, info types to discover, and more.

Structural Breakdown of Each Setting

The first set of options we will break down are Clusters, Schedule, and Info types to be discovered.
notion image
A cluster is a name used to identify a scanner’s configurations and repositories. Clusters may contain multiple Nodes or servers within them to scale out your scans and should be thought of as groups.
Schedule allows you to specify how often the scanner runs on your specified repositories.
  • Manual: A single scan started manually. For example, by running Start-AIPScan locally on the server via PowerShell or by using the Scan now option in the portal.
  • Always: The specified repositories are repeatedly scanned in sequence, and the information protection scanner service is not stopped. This option is useful for file shares that are in frequent use.
Info types to be discovered is generally the most misunderstood setting of the PIP scanner. Microsoft’s documentation does not clearly outline what exactly “policy only” looks for or enforces, so you may get conflicting information when researching various sources.
  • Policy Only: The scanner uses the conditions (predefined information types and custom) that you have specified for labels. In other terms, this means the scanner should only search for SITs you have defined in auto-labeling conditions for the labels published to the scanner account.
    • This option is highly recommended for organizations that want to discover specific types of information such as financial data, PII, and more.
      • The auto-labeling policy for on-premises locations can only be created within a Sensitivity Label and cannot be a standalone policy.
  • All: The scanner will scan all custom & built-in Microsoft SITs on the specified repositories to gather as much data as possible.
    • This option is fine to use while initially testing and discovering data on your servers. However, this is a very broad scan, and you will see many false positives as a result.

Policy Only Example

I have a sensitivity label called “Classified On-Prem” scoped out to “Files & other data assets”.
notion image
notion image
You also have the option to configure encryption settings and content marking, allowing you to control access (encryption) to labeled items and apply visual indicators such as custom watermarks. For this scenario, these options will not be utilized as I will be labeling content for visibility purposes only.
notion image
The most important option to consider when implementing the AIP service is to create an auto-labeling policy to detect specific sensitive information within your file shares or SharePoint. In this scenario, I have two separate conditions searching for 1 or more instances of:
  • Financial Information:
    • U.S. Bank Account Numbers, ABA Routing Numbers, or Credit Card Numbers
OR
  • Personal Identifiable Information:
    • U.S. / U.K. Passport Numbers, U.S. Driver’s License Numbers, U.S. Individual Taxpayer Identification Number (ITIN), U.S. Social Security Number (SSN)
This policy will automatically apply theClassified On-Prem” label when either condition is met.
notion image
Once this label has been created and published to your service account, selecting “Policy only” will only detect the SITs outlined in your auto-labeling policy.
Next, we will cover the remaining General options when creating a scan job:
notion image
Treat recommended labeling as automatic:
When you create a sensitivity label, you have an option to configure auto-labeling to either automatically apply a label when certain conditions are met or recommend users to apply it.
Since the scanner runs in non-interactive mode, you have an option to configure the scanner to automatically apply a label, even though it was configured to “recommend” in the label properties.
This can be useful when wanting to utilize a policy that is not configured for automatic application.
Enable DLP policy rules:
Defines whether an on-premises repository DLP policy is applicable in this content scan job.
  • Off: Disables DLP policy evaluation on this content scan job
  • On: Enables DLP policy evaluation on this content scan job
Enforce sensitivity labeling policy:
  • If set to “Off” the scanner won’t apply any label or protection and will be in Discovery mode. This is a good option when you want to understand what sensitive information you have.
  • On” tells the scanner to apply, based on other options, either a default label or labels published to the scanner account.
If you have multiple labels published to your scanner account, auto-labeling policies will be enforced and applied if “Label files based on content” is enabled.
Label files based on content:
Allows you to either apply a default label without content inspection or inspect files for the SITs you have specified for your labels (auto-labeling policy).
Default label:
Specifies whether the scanner sets a default label on unlabeled files for this data repository. You can apply the default label from the information protection policy, or another label:
  • None: For unlabeled files, do not apply a default label.
  • Policy default: For unlabeled files, apply the default label that is specified in the information protection policy.
  • Custom: For unlabeled files, apply the specified label.
In this scenario, I selected a custom default label and set it as “Classified On-Prem”. This label will be automatically applied as a default when files matching our auto-labeling conditions are met.
notion image
Relabel files:
Specify whether to apply a different label to a file that's already labeled.
By default, the scanner doesn't relabel the files, unless the new label has higher sensitivity than the current label, and the initial label was not manually applied by an end user. When you select On, the scanner always replaces an existing label when the configured conditions apply.
Preserve "Date modified", "Last modified", and "Modified by":
Specify whether to leave the date unchanged for documents that the scanner labels:
  • Off: For local or network files, the Last Modified date is changed. For SharePoint files, the Modified date and Modified By are changed.
  • On: For local or network files, the Last Modified date remains unchanged. For SharePoint files, the Modified date and Modified By remain unchanged.
Include or exclude file type to scan:
Specifies the file types to be included or excluded from scanning. It’s highly recommended to include all file types that are supported by Azure Rights Management system.
  • To scan all files except specific file types, select Exclude and type the list of file name extensions to exclude from scanning. For example: .exe,.com,.bat
  • To scan specific file types, select Include and type the list of file name extensions to be scanned. For example: .doc,.docx,.xls,.xlsx
Default owner:
Specifies the email address for the Owner custom property when a file is classified, and for the Rights Management owner if the file is not already protected. This setting is can typically be left as default which will be the Scanner Account.
  • For files on SharePoint Server, the SharePoint Editor (Last Modified By) value is used to set the owner of the file.
  • For files on SharePoint Server that do not have the Editor (Last Modified By) property set or if this property is set to a deleted user account, and for files that are stored on file shares or local folders, the setting specified in this field is used.
Set repository owner:
Specifies the UPN of a user or group that owns the repository.
The owners are granted full control permissions on file if the permissions on the file are changed by matched DLP rule.

Repositories

Now that we have broken down all of the general settings within a content scan job, it’s time to specify which repositories we want to scan.
Next to the general tab, click on repositories. Here, you can begin to add file paths you’d like to scan within your server. You can add these repositories manually by selecting the “+ Add” button or importing a .CSV file.
On the Repository pane, specify the path for the data repository, and then select Save.
  • For a network share, use \\Server\Folder.
  • For a SharePoint library, use http://sharepoint.contoso.com/Shared%20Documents/Folder.
  • For a local path: C:\Folder
  • For a UNC path: \\Server\Folder
🚨
Wildcards are not supported and WebDav locations are not supported. Scanning of OneDrive locations as repositories is not supported.
Network shares are supported so long as the server has access to the specified path.
In my case, I created a custom folder within my server’s C: drive and entered it as my path: C:\AIP Scanner Testing 1
Once a path is specified, we are presented with additional options to override our default content scan job settings. This is helpful if you want to scan multiple repositories but have different policies apply to them as needed.
notion image

Installing the Purview Information Protection Client

The Microsoft Purview Information Protection client is required to be installed onto your server prior to installing the scanner.
Once the client has been installed, we can continue with configuring SQL, run PowerShell commands and install the scanner.

SQL Server Express Installation

If the file server does not already have a SQL instance available, you can use SQL Server Express to quickly deploy the scanner and begin scanning.
This approach is suitable for smaller or test environments, though larger or production deployments should use SQL Server Standard or Enterprise for scalability and performance.
For complex environments with multiple file shares, it is recommended to use SQL Server Standard or Enterprise for the scanner database. Additionally, deploy a dedicated scanner service on each server hosting file shares to improve performance, reduce scan times, and maintain a more scalable and organized architecture.
You can download SQL Server 2025 Express here:
Select Basic as the installation type:
notion image
Once complete, Install SSMS:
notion image
Open SSMS and connect to the new database by clicking Connect > Browse > Local > Name of the SQL Server Instance > Trust Server Certificate > Connect
notion image
Once complete, right click on your Server Instance Name > Properties > Copy the name of your instance
notion image

Scanner Installation

Now that we’ve created a new SQL Server Express instance, we can use the name of it to install the scanner locally via PowerShell.
Run the Install-Scanner cmdlet, specifying your SQL Server instance on which to create a database for the information protection scanner, and the scanner cluster name that you specified in the preceding section:
PowerShell Commands
For SQL Server Express Example:
Install-Scanner -SqlServerInstance WIN22-DC1\SQLEXPRESS -Cluster WIN22-DC01
notion image
For a default instance:
Install-Scanner -SqlServerInstance <name> -Cluster <cluster name>
For a named instance:
Install-Scanner -SqlServerInstance SQLSERVER1\SCANNER -Cluster Europe
Now that you've installed the scanner, you need to get a Microsoft Entra token for the scanner service account to authenticate via an App Registration.

Enterprise Application Registration

A Microsoft Entra token is required as it allows the scanner to authenticate to the Microsoft Purview Information Protection Scanner service, enabling the scanner to run unattended. To get a token, we need to create a new App Registration within Entra.
  1. In the Microsoft Entra ID side pane, click App Registrations.
  1. At the top, go ahead and click + New registration.
  1. In the Name section type in Information Protection Scanner.
  1. Leave Supported account types as default.
  1. For the Redirect URI, leave the type as Web but type in http://localhost for the entry portion and click Register.
notion image
On the Overview page of this application, note down in your text editor of choice the following IDs: Application (client) ID and Directory (tenant) ID. You will need this later when setting up the Set-AIPAuthentication command.
notion image
  1. On the side pane, navigate to Certificates and Secrets.
  1. Click on + New client secret.
  1. In the dialog box that shows up, enter a description for your secret and set it to Expire In 1 year and then Add the secret.
notion image
You should see now under the client secrets section that there is an entry with the Secret Value. Go ahead and copy this value and store it in the file where you saved the Client ID and Tenant ID. This is the only time you will be able to see the secret value; it will not be recoverable if you don't copy it at this time.
notion image
  1. On the side pane, navigate to API Permissions > select Add a permission.
  1. When the screen shows, select Azure Rights Management Service. Then select Application Permissions.
  1. Click the drop down for Content and put checkmarks down for Content.DelegatedReader and Content.DelegatedWriter. Then at the bottom of the screen, click Add Permissions.
notion image
  1. Navigate back the API Permissions section and add another permission.
  1. This time, for the Select an API section, click on APIs my organization uses. In the search bar, type in Microsoft Information Protection Sync Service and select it.
  1. Select Application Permissions and then in the Unified Policy drop down, checkmark the permission UnifiedPolicy.Tenant.Read. Then at the bottom of the screen, click Add Permissions.
notion image
  1. Back on the API Permissions screen, click Grant Admin Consent and look for the operation being successful (signified by a green checkmark).
notion image
notion image
The app registration process is now complete, and we can now get an authentication token on our server.
if your scanner service account has been granted the Log on locally right for the installation, sign in with this account and start a PowerShell session.
🚨
When the token expires, you must repeat this procedure.
Run Set-Authentication, specifying the App ID, Secret ID, and Tenant ID that you copied from the previous step: Set-Authentication -AppId <ID of the registered app> -AppSecret <client secret sting> -TenantId <your tenant ID> -DelegatedUser <Azure AD account>
notion image
If the command was run successfully, you should see a prompt stating “Acquired access token.

Running Scans via PowerShell and the UI

Now that we’ve finished the entire configuration process to get the scanner up and running, a node should now appear within the Purview portal:
notion image
Nodes are simply SQL server instances that have PIP service and scanner installed on them.
🚨
Nodes can only be a member of a single cluster in Purview. However, you can have multiple nodes within a single cluster.
Microsoft provides a full list of all PowerShell commands related to the PIP scanner found here: PurviewInformationProtection Module | Microsoft Learn
To verify the scanner is configured and working properly locally, run Start-ScannerDiagnostics with the -onbehalf parameter:
$scanner_account_creds= Get-Credential
Start-ScannerDiagnostics -OnBehalfOf $scanner_account_creds
If successful, all checks should come back green which means we are ready to run scans.
notion image
We can also view our current content scan job settings by using the following command: Get-ScannerRepository
notion image
All settings shown here align with the configurations previously defined in the UI. These settings can be managed and modified through both PowerShell and the UI as needed.
To start scanning for sensitive data, go to your Content Scan Jobs > Select your scan job > Scan now
notion image
This will begin to scan your specified repositories on the server. To check the status of your current scan job, either refresh the scan job page and check the “Last scan end time” or head back to the server and enter the following command: Get-ScanStatus
notion image
The cluster status should display as “Scanning”, along with details such as the scan start time. When complete, the status will change from “Scanning” to “Idle”.
If you began a scan and you notice that it is stuck in “Idle” status, try the following:
  • Restart the Microsoft Purview Information Protection Scanner service within services.msc
notion image
💡
Restarting the service is also very helpful for pushing configurations that are not syncing locally to the scanner or if you notice other miscellaneous issues.
  • Update the scanner data base by running the following command:
    Update-ScannerDatabase -cluster <cluster name>
    • This command updates the database schema for the Microsoft Purview Information Protection scanner.
Once the scan is complete, we can begin reviewing the results and take appropriate action based on the findings.

Reviewing Scan Results

When the scan is complete, review the reports stored in:
%localappdata%\Microsoft\MSIP\Scanner\Reports
The Summary_<x>.txt file includes the time taken to scan, the number of scanned files, and how many files had a match for the information types.
In this scenario, “Enforce sensitivity labeling policy mode” is turned off meaning anything actions that would have occurred, are in discovery mode only. This gives you a quick overview of the files scanned in your repository.
notion image
The .csv files have more details for each file. This folder stores up to 60 reports for each scanning cycle and all but the latest report is compressed to help minimize the required disk space.
notion image
  • Repository — File path of the scanned repository where the file was found
  • File Name — Name of the scanned file
  • Status — Whether the file was scanned successfully or failed
  • Comment — Optional field for notes while reviewing the report
  • Current Label — Current sensitivity label applied to the discovered file
  • Current Label ID — ID of the current sensitivity label applied to the discovered file
  • Applied Label — Label that was applied to the discovered file
  • Applied Label ID — ID of the label that was applied to the discovered file
  • Condition Name — Blank; purpose unclear / not documented
  • Information Type Name — Sensitive Info Type (SIT) detected in the file
  • Action — Action taken (for example, Classified means a label was applied)
  • Last Modified — Date the file was last modified in the file share
  • Last Modified By — User who last modified the file
  • Protection Before Action — Whether the file was protected (encrypted) before the action
  • Protection After Action — Whether the file is protected (encrypted) after the action
If your organization’s plan is to label the sensitive content that was discovered on-premises, change the “Enforce sensitivity label policy” to “On” and re-run the scan.
notion image
Once complete, all of the files containing sensitive info will be labeled and protected by Azure Rights Management / Information Protection labels.

Creating an On-Prem DLP Policy (optional)

Outside of discovering and labeling sensitive content on-premises, Purview also provides the ability to create DLP policies that can restrict access to, audit, or remove sensitive files. This helps enforce organizational data protection requirements and further prevent unauthorized data exposure.
Upon building a policy, we can select “On-premises repositories” as a location and specify whether we want to target All repositories, exclude any, or include specific ones.
notion image
When creating our rules, we can follow the same configuration outlined earlier in the article for auto-labeling to detect financial or PII data and restrict access to on-premises files containing them.
notion image
If needed, we can add Exceptions to the rule to not target files containing specific SITs, extensions, or document properties.
notion image
With our conditions set, we can now determine whether or not actions need to be applied when our rules are met.
💡
If you do not enforce any actions on the DLP policy, exfiltration that occurs with sensitive files are still captured as activities by Purview in “Audit” mode. These activities can be viewed by navigating to Data Loss Prevention > Explorers > Activity Explorer.
notion image
Restrict access or remove on-premises files enables enforcement actions on files located in:
  • File shares
  • On-prem SharePoint
Block people from accessing files stored in on-premises repositories options:
  • Block everyone
    • Removes access for all users with the exception of the file owner, last modifier, and local administrators.
  • Block only people who have access to your on-premises network and users in your organization who weren’t granted explicit access to the files
    • Restricts access only to users who are not explicitly listed in NTFS permissions
    • Users inheriting access broadly such as “Domain Users”
  • Set permissions on the file (permissions will be inherited from the parent folder)
    • Re-applies or enforces permissions based on the parent directory within On-Prem SharePoint
  • Move file to a quarantine folder
    • Relocates the file to a designated quarantine location in SharePoint or your file shares.
Unlike traditional policies, User notifications and overrides are not supported for on-premises locations:
notion image
Lastly, we can set a severity level for this rule for reporting and visibility. Depending on your configuration, you can set a level from Low to High accordingly.
Send an alert to admins when a rule match occurs” does not send email notifications to your administrators, this is a common misconception.
Toggling this feature on simply sends matches to your DLP alerts dashboard in Purview for visibility.
You may also choose to alert every time an activity matches the rule or set a threshold when a specific volume of activities is met. For the most alerts, leave this setting as default.
Additional options do not apply to on-premises DLP policies unless you have multiple rules in place and want to stop rule processing when an alert is generated.
Priority is only applicable if you have multiple rules within a single policy and you want to change the processing order.
notion image
Once all of our conditions, actions, and alerts have been set, we can turn the policy on and begin blocking unwanted activities on-premises.
Simulation mode is not supported at the time of writing this article. The policy can only be turned on or off.
notion image

Reviewing On-Premises Label Activities and Alerts

Within Data Loss Prevention or Information Protection, navigate to Explorers > Activity Explorer to begin reviewing the results of the PIP scanner.
The PIP scanner results appear under a handful of activities and searchable filters. These include:
  • Sensitivity label applied
  • Sensitivity label changed
  • Sensitivity label removed
  • Sensitivity label file read
  • Files discovered
  • Sensitivity label file renamed
  • File removed
  • Protection applied
  • Protection changed
  • Protection removed
  • Data loss prevention (DLP) policy matched
For matches on the scanner, the “Endpoint devices” location and the “User” should be tied to the service account. Activities related to my scans should also appear as “Files discovered” and “Label applied” if you enforced sensitivity labeling. Using the filters at the top of the report can be very useful to find your matches effectively.
notion image
Opening the “File discovered” activity, it gives you details such as the name of the file, the SITs that were detected, the scanned repository, file path of the source file, and the application used to discover it:
notion image
For “Label applied”, we get similar data with the addition of which sensitivity label was applied.
notion image
If you are looking for activities related to your On-Prem DLP Policy, you can add the “Policy” filter and sort by your policy’s name. Click on Add filter > Policy > Add
If your DLP policy does not appear under the filter, it’s because no matches have been detected yet and activities have not been logged.
notion image

Conclusion

Deploying the Purview Information Protection (PIP) Scanner is a practical way to extend your sensitivity labeling and DLP strategy to on-premises file shares and SharePoint, giving you visibility into where sensitive data lives and helping you apply consistent protection. With the right prerequisites in place - licensing, a properly permissioned service account, a supported SQL instance, and outbound connectivity; you can configure scanner clusters and content scan jobs to match your environment and goals.
Start in discovery mode to validate repositories, schedules, and information type detection (especially when using Policy only), and use scanner diagnostics and reporting to confirm results and tune scope before enforcing actions.
Once you are confident in the findings, enabling enforcement allows labels and, if desired, protection to be applied at scale, and optional on-prem DLP policies can further restrict access or quarantine high-risk content. From discovery to enforcement, you can reduce false positives, minimize operational risk, and roll out on-prem data protection in a controlled, auditable way.
Sources