Introduction
Data scientists undergo a long journey to fully complete a project, switching between many tools and working with many stakeholders in order to achieve one end result. Throughout this process, unexpected roadblocks from uncontrollable factors may pop up. Teams may not have enough resources to meet demand. Communication may break down in the process. And in unfortunate cases, the end results may not even convert into anything substantial if stakeholders do not understand and support the project.
But just what are the top pain points today? To answer that question, Einblick conducted a survey to understand what challenges data teams face. Our survey asked data scientists and managers across different industries a series of questions about their current tools, challenges, and priorities, and we have highlighted some of those findings here.
Communication issues on two-fronts
Undefined project goals was, by a large margin, the biggest complaint of data scientists, with 75% ranking it as a frequent problem. Alongside, ~55% of respondents noted the lack of non-technical stakeholder buy-in as a further challenge. These identified challenges both concern communication between data scientists and managers and/or stakeholders, and these issues have only been exacerbated by the rise of remote work environments during the pandemic.
The root cause? Data scientists’ platforms lack methods for teams to effectively and easily communicate ideas and insights. During initial planning meetings, discussing project goals is important so that everyone is on the same page. Later on in the process, sharing draft results, discussing methodology, and quickly iterating on stakeholder feedback becomes important. But, Zoom can only do so much, and live collaboration and whiteboarding features are uncommon in data science tools.
Separate tools exist for video conferencing, commenting, and sharing files among teams but none combine all these aspects on the same platform. Switching back and forth between tools is not efficient and may require multiple apps open at once. To wit, our survey found that ~60% of respondents use PowerPoint to collaborate with non-technical stakeholders. The next biggest majority was ‘written reporting/documentation’ with ~25% of responses citing it as the medium for communication. That means almost 85% of collaboration in data science is functionally done in Microsoft Office.
Moreover, converting code and computational notebooks into presentation slides is not necessarily a data scientist’s favorite activity and typically involves copying-and-pasting graphs. The data scientists’ work and the presentation lie on two unconnected platforms, leaving room for missing information. This also requires data scientists to flip back-and-forth between where their work is and where the presentation is, which becomes inefficient when stakeholders ask to see areas that were not included in the presentation.
Many roadblocks at the beginning of the data science process
Our survey also provided a list of challenges for data scientists, and we asked respondents to rate how often their team faced these challenges, on a scale from 1 to 3 (least often to most often, non-uniquely). We found that the most frequent challenges faced by data scientists occur before any coding even happens. About half of respondents would mark each of “low overall bandwidth for new projects” and “unavailable data.”
There are many external factors that affect the capacity to tackle new projects and availability of data. Chiefly, data science is an in-demand job, and insufficient data science personnel within a company or department places more pressure on existing data scientists. Some data analytics and data sciences tools do help non-technical domain experts to share some of the data responsibilities and streamline some aspects of the process; however, cost and use-case determine whether teams would utilize another application in addition to what they already use.
Even if there are enough human resources, available and relevant data is needed but not always easily found. For example, there might be a huge gap between the data scientist’s existing knowledge about the data and the knowledge that a current domain expert has. Other times, private data may require hurdles and permissions before being accessed.
Finally, when data scientists do get the data, the datasets may be extremely large, making it difficult to handle and analyze. A further 35% of respondents had marked “unwieldy dataset size” as a roadblock. Some teams have the luxury of advanced compute environments (which can come at a steep cost), while other teams must struggle and homebrew their way past kernel crashes.
Conclusion
Our survey found that many roadblocks occur for data scientists at the beginning of the data science process, and when information is communicated to nontechnical stakeholders. While not all of these issues are easily solved, some tools can mediate some communication and computational challenges.
Einblick offers live collaboration and whiteboarding capabilities on a workspace, all in the same application, which also provides quick model feedback. The main features that mitigate the issues discussed above:
- Multiplayer mode allows for group participation by the domain expert, the data engineer, and the main data scientist
- Progressive computation engine handles large datasets up to 10TB with no setup and no special code
- The platform enables data scientists to be quick and flexible, working through multiple processes at once, saving time and opening up space to handle more projects
Ultimately, our survey found that data teams face obstacles in the workflow process when defining project goals, acquiring the relevant data, processing large amounts of data, and garnering stakeholder buy-in. While many data tools exist, data teams need a tool or tools that address these key obstacles and actually make life easier.