Data Management for AI/ML Projects – From Project Management Perspective

 

Effective data management is crucial in AI/ML projects to ensure optimal performance of the models. As a project manager, one must oversee and manage various aspects, from data provisioning to data quality, to ensuring that the data pipeline is robust and reliable, etc. This article sees data management from the project management angle and relates the project management aspects to ensure successful and on time project delivery.

Key Aspects of Data Management

1. Adequate Amount of Data

  • Data Provisioning: Extract data from multiple reliable sources, such as public datasets, proprietary databases, web scraping, and user-generated content.
  • Data Augmentation: Increase the dataset size using relevant data augmentation techniques if need be.
  • Historical Data: Use relevant and up-to-date historical data.
  • Data Storage: Ensure the infrastructure can store and process large datasets, potentially using cloud-based solutions.

2. Data Quality

  • Data Representation: Work with domain experts to conduct data profiling to understand its composition and ensure that the data is representative for the use case.
  • Balancing the Dataset: Use techniques like SMOTE for oversampling minority classes and random undersampling for majority classes, ensuring stratified sampling for training, validation, and test sets.

3. Quality Annotations

  • Annotation Guidelines: Provide clear and detailed guidelines and training for annotators.
  • Quality Control: Measure inter-annotator agreement, implement expert validation of a subset of annotations, and use automated checks to detect inconsistencies.

4. Data Pipeline

  • Data Cleaning: Handle missing values through imputation or removal and eliminate duplicate records.
  • Data Transformation: Apply normalization and standardization to numerical features and encode categorical variables using techniques like one-hot or label encoding.
  • Data Split: Typically, split the data into 80% for training and 20% for testing, ensuring the distribution of categories is maintained in both sets.
  • Cross-Validation: Use k-fold cross-validation to further evaluate model performance by dividing the data into k subsets and training the model k times.

 

Poor data management will impede progress and affect the delivery of AI/ML projects. Understanding the key aspects of data management and proper project management on these key aspects is one of the key success factors of AI/ML projects. 

Issues related to data management are not uncommon in AI/ML projects. There are many stories about the project team not having sufficient data volume to work on the mode training. It is common to hear that the data is imbalanced and there are whole lots of quality issues in the data. There are also cases that data annotation is not properly done or not done at all. As a project manager, one is required to remove these blocker issues so that the project team is able to work on the data related tasks. Being the liaison between the project team and the stakeholders, it is also the project manager’s responsibility to ensure that both the project team and the stakeholders are aware of who’s accountable for the issues and the corrective actions, and what level of impact the issues have on the project.

Conclusion

Understanding the key aspects of data management is crucial for project managers to have success in delivering AI/ML projects. It takes an experienced project manager to identify the project risks related to data management from the onset of the project and set mitigation plans for them. It also takes an effective project manager to manage the project issues related to data management and track them to proper closure. 

How do you manage your AI/ML projects in terms of data management? I welcome you to share your experience and thoughts in this area.

Author