In the current Data Science Job environment a Data Engineers will never get to work on scalability spend most of their time in ETL, data preparation and coding models. Similarly Data Scientists will never invent a new algorithm as even minor tasks require Python and Scala skills and multiple design sessions with Data Engineers and DevOps teams. DevOps engineers also grow weary of being given nothing more to do than deploy the next Hadoop cluster through the Cloudera wizard and semi-automated Ansible scripts.
Apache Spark/Hadoop ecosystem is great but it is not stable and user-friendly enough to just run and forget. Data Engineers and Data Scientists should contribute to existing open source projects and create new tools to fill the gaps in day-to-day operations.
When Data Scientists code they need to think not just about abstractions but need to consider the practical issues (like how long their query will run or whether the extracted data will fit into the storage).
DevOps does not just mean writing Ansible scripts and installing Jenkins. DevOps needs to reduce handoff and invent new tools to give Data Engineers and Data Scientists self-service.
An Ideal Real-Time Continuous Analytics Environment
- Data Scientists own the data project from the original idea all the way to production. He or She starts from the original business idea and works through data exploration and data preparation. Then he or she moves to model development. Next comes deployment and validating the environment. Finally there is the push to production. Given the proper tools, he or she could run this complete iteration multiple times a day without relying on the Big Data engineer.
- Data Engineers work on scalability and storage optimization, developing and contributing to tools like Spark, enabling streaming architectures and so on.
- DevOps should give product engineers models developed by Data Scientists as neatly-bundled services. Then they can build smart applications for business users.