The Team Data Science Process (TDSP) is a framework developed by Microsoft that provides a structured methodology to build predictive analytics solutions and intelligent applications efficiently. This article outlines the key personnel roles, and their associated tasks that are handled by a data science team standardizing on this process. This introduction links to tutorials that provide instructions on how to set up the TDSP environment for the entire data science group, data science teams, and projects. We provide detailed guidances using Visual Studio Team Services (VSTS) in the tutorials as our code-hosting platform and agile planning tool to manage team tasks, control access, and manage the repositories. You will be able to use this information to implement TDSP on your own code-hosting and agile planning tool.
We have specified four distinct roles for our team personnel:
-
Group Manager. Group Manager is the manager of the entire data science unit in an enterprise. A data science unit might have multiple teams, each of which is working on multiple data science projects in distinct business verticals. A Group Manager might delegate their tasks to a surrogate, but the tasks associated with the role do not change.
-
Team Lead. A team lead is managing a team in the data science unit of an enterprise. A team consists of multiple data scientists. For data science unit with only a small number of data scientists, the Group Manager and the Team Lead might be the same person.
-
Project Lead. A project lead manages the daily activities of individual data scientists on a specific data science project.
-
Project Individual Contributor (Data Scientist, Business Analyst, Data Engineer, Architect, etc). A project individual contributor executes a data science project.
NOTE: Depending on the team, a single person may play more than one roles OR there may be more than one person working on a role.
The following picture depicts the top-level tasks for personnel by role in adopting and implementing the Team Data Science Process as conceptualized by Microsoft.
This schema and the following, more detailed outline of tasks that are assigned to each role in the TDSP should help you choose the appropriate tutorial based on your responsibilities in the organization.
[AZURE.NOTE] In the following instructions, we show steps of how to set up a TDSP environment and complete other data science tasks in Visual Studio Team Services (VSTS). We specify how to accomplish these tasks with VSTS because that is what we are using to implement TDSP at Microsoft. VSTS facilitates collaboration by integrating the management of work items that track tasks and a code hosting service used to share utilities, organize versions, and provide role-based security. You are able to choose other platforms, if you prefer, to implement the tasks outlined by the TDSP. But depending on your platform, some features we leverage from VSTS may not be available. We also use the Data Science Virtual Machine (DSVM) on the Azure cloud as the analytics desktop with several popular data science tools pre-configured and integrated with various Microsoft software and Azure services. You can use the DSVM or any other development environment to implement TDSP.
The following tasks are completed by the Group Manager (or a designated TDSP system administrator) to adopt the TDSP:
- Create a group account on a code hosting platform (like GitHub, Git, VSTS, or others)
- Create a project template repository on the group account, and seed it from the project template repository developed by Microsoft TDSP team. The TDSP project template repository from Microsoft provides a standardized directory structure including directories for data, code, and documents, and provides a set of standardized document templates to guide an efficient data science process.
- Create a utility repository, and seed it from the utility repository developed by Microsoft TDSP team. The TDSP utility repository from Microsoft provides a set of useful utilities to make the work of a data scientist more efficient, including utilities for interactive data exploration, analysis, and reporting, and for baseline modeling and reporting.
- Set up the security control policy of these two repositories on your group account.
For detailed step-by-step instructions, see Group Manager tasks for a data science team.
The following tasks are completed by the Team Lead (or a designated team project administrator) to adopt the TDSP:
- If VSTS is selected to be the code hosting platform for versioning and collaboration, create a team project on the group's VSTS server. Otherwise, this task can be skipped.
- Create the team project template repository under the team project, and seed it from the group project template repository set up by your group manager or the delegate of the manager.
- Create the team utility repository, and add the team-specific utilities to the repository.
- (Optional) Create Azure file storage to be used to store data assets that can be useful for the entire team. Other team members can mount this shared cloud file store on their analytics desktops.
- (Optional) Mount the Azure file storage to the Data Science Virtual Machine (DSVM) of the team lead and add data assets on it.
- Set up the security control by adding team members and configure their privileges.
For detailed step-by-step instructions, see Team Lead tasks for a data science team.
The following tasks are completed by the Project Lead to adopt the TDSP:
- Create a project repository under the team project, and seed it from the Team project template repository.
- (Optional) Create Azure file storage to be used to store data assets of the project.
- (Optional) Mount the Azure file storage to the Data Science Virtual Machine (DSVM) of the Project Lead and add project data assets on it.
- Set up the security control by adding project members and configure their privileges.
For detailed step-by-step instructions, see Project Lead tasks for a data science team.
The following tasks are completed by a Project Individual Contributor (usually a Data Scientist) to conduct the data science project using the TDSP:
- Clone the project repository set up by the project lead.
- (Optional) Mount the shared Azure file storage of the team and project on their Data Science Virtual Machine (DSVM).
- Execute the project.
For detailed step-by-step instructions for onboarding onto a project, see Project Individual Contributors for a data science team.
By following the relevant set of instructions, data scientists, project lead, and team leads can create work items to track all tasks and stages that a project needs from its beginning to its end. Using git also promotes collaboration among data scientists and ensures that the artifacts generated during project execution are version controlled and shared by all project members.
The instructions provided for project execution have been developed based on the assumption that both work items and project git repositories are on VSTS. Using VSTS for both allows you to link your work items with the Git branches of your project repositories. In this way, you can easily track what has been done for a work item.
The following figure outlines this workflow for project execution using the TDSP.
The workflow includes steps that can be grouped into three activities:
- Sprint planning (Project Lead)
- Developing artifacts on git branches to address work items (Data Scientist)
- Code review and merging branches with master branches (Project Lead or other team members)
For detailed step-by-step instructions on project execution workflow, see Execution of data science projects.