Data Profiles

Do you know what's going on in your data pipelines? Data tests provide some context into potential issues with the data in your pipelines. Data profiles provide even more context. What if you could profile your data at each step of your data pipeline and track these profiles over time? Well Panda Patrol can do just that. Panda Patrol can store data profiles in a database and provide a dashboard to view these profiles. Thus at any point in time, you can see what your data looked like and quickly debug data issues.

Basic Data Profile

The panda_patrol library provides a basic_data_profile method that uses ydata-profiling to generate and store (see next section) a basic data profile. This method takes the following parameters:

df: pd.DataFrame - Dataframe to profile.
patrol_group: str - Name of the patrol group. See Patrol Groups and Patrols for more information.
patrol: str - Name of the patrol. See Patrol Groups and Patrols for more information.

The following example uses Prefect for data pipeline orchestration.

from ydata_profiling import ProfileReport
from panda_patrol.profilers import basic_data_profile

@task(retries=3)
def fetch_data():
    """
    Fetch the Titanic dataset using requests.
    """
    url = (
        "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
    )
    response = requests.get(url)
    data = pd.read_csv(StringIO(response.text))
    basic_data_profile(data, "Titanic Dataset", "Profiling Report")

    return data

@flow(name="Titanic", log_prints=True)
def run_titanic_analysis():
    """
    Fetch and transform the Titanic dataset.
    """
    fetch_data()


if __name__ == "__main__":
    run_titanic_analysis.serve(name="run_titanic_analysis")

Storing Data Profiles

The panda_patrol library provides a save_report method that can be used to store data profiles in a database. This method takes the following parameters:

report_string: str - String of the data profile report that was generated.
patrol_group: str - Name of the patrol group. See Patrol Groups and Patrols for more information.
patrol: str - Name of the patrol. See Patrol Groups and Patrols for more information.
report_format: str - Format of the report. Supports 'html', 'json', or 'image'.

The save_report method will store the data profile report in the database. Users can then interact with this report on the frontend. The library comes with a basic data profiler but you can use any data profiler you want. Some popular data profilers include:

DataProfiler

The following example uses Prefect for data pipeline orchestration.

from ydata_profiling import ProfileReport
from panda_patrol.profilers import save_report

@task(retries=3)
def fetch_data():
    """
    Fetch the Titanic dataset using requests.
    """
    url = (
        "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
    )
    response = requests.get(url)
    data = pd.read_csv(StringIO(response.text))

    # Generate a data profile report using ydata-profiling
    report = ProfileReport(data, title="Titanic Dataset")
    # Save the report to Panda Patrol
    save_report(report.to_html(), "Titanic Dataset", "Profiling Report", "html")

    return data

@flow(name="Titanic", log_prints=True)
def run_titanic_analysis():
    """
    Fetch and transform the Titanic dataset.
    """
    fetch_data()


if __name__ == "__main__":
    run_titanic_analysis.serve(name="run_titanic_analysis")

This generates the following dashboard: profiling-dashboard

Documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data Profiles

Basic Data Profile

Storing Data Profiles

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally