Data Science Workflow
This repo contains a set of resources, tutorials and general best practices that our team has developed and continues to refine for effectively collaborating as a group of data scientists. Our team relies heavily on the Open Data Hub (ODH) project (which we also recommend), so many of our examples will use that toolbox. But our hope is that there is enough generally applicable content that this information can be helpful to those outside the ODH ecosystem. Further, we invite others to contribute their best practices and implementation alternatives : )
Setup Your Environment
- What is the Open Data Hub and Operate First?
- Access JupyterHub
- Manage data with remote storage
- Setup CI
Develop + Collaborate
- Use Thoth tooling to enhance development
- Best practice for contributing as a Data Scientist
- Build ML pipelines from notebooks
- Track your metrics and experiments
- Share reproducible notebook images
- Quickly deploy interactive environments and JupyterBooks
- Create interactive dashboards to visualize results
- Getting Started with Data Science
Serving + Monitoring
- Monitor JupyterHub environment workloads
- Serve your model with Seldon
- Create custom serving images with Seldon
Project Organization + Structure
- Tips for starting a new ML project from scratch
- Recommendations for structuring an E2E ML project
- Template for writing a project document
- Simplify Project Management with GitHub project boards and issue
E2E Example Projects
Contact
This project is maintained as part of the Operate First and Emerging Technologies group in Red Hat’s Office of the CTO. More information can be found at https://www.operate-first.cloud/.
Have a question? Open an issue in this repo or join us on Slack in the #data-science channel. :)