Data science is an art form

Paul Yang - February 28th, 2023

Jonah Hill and Brad Pitt in Moneyball (2011)Jonah Hill and Brad Pitt in Moneyball (2011)

As data professionals, it is our job to provide answers to business questions that can be confusingly opaque. From my own experience, hopefully mirroring yours as well, it is clear that data science is a creative design profession.

As one example, exploratory data analysis (EDA) is the foundation of all data projects. Exploration inherently implies ambiguity and uncertainty, which is a part of all artistic endeavors. Artists often start writing, painting, and creating with just an idea and the commitment to pursue it. By embracing data science as an art form, we can take the best of both arts and science to push what is possible in the field. Here’s why, and what we as data scientists have to learn from designers and the visual arts.

Creativity is meandering

Person standing at a form in the roadPerson standing at a form in the road

Art is not produced from a linear process. This may be obvious, but every good design process starts with a stage of ambiguity. Initial ideas are explored by the designer – and most initial ideas are bad. Even the Mona Lisa may have been dramatically changed from the first draft to the final product. But from every dead-end proceeds fresh new ideas as well, which the artist then further needs time to explore thoroughly.

In a technical field like data science, it can be easy for stakeholders to prioritize measurable results over all else. But as data scientists, we should embrace ambiguity, and advocate for greater resources (i.e. time) to be put towards discovery work. Many articles, maybe written by impatient Harvard Business School managers, warn about “analysis paralysis” – when someone analyzes a situation too extensively, such that decisions are not made in a timely fashion. And even if data scientists do not get stakeholder pressure, processes like exploratory data analysis can feel mundane, and sometimes endless. We pull at different threads, but no one in particular seems to bear fruit. But to take a step back, and to push back, creativity is not linear.

We must proactively budget time to simply marinate in data and not make progress. Every data project does have ambiguity, and any ambiguity about goals or deployment methods needs to be clarified with stakeholders only after context knowledge exists. And there is usually some missing context or faulty assumption only detectable once intuition is built about a dataset. Finally, rigorous methodology is not trivial – sometimes, picking the right train-test split takes a moment of careful thought (it's not always a random split!). Only within this non-linear exploration can great data science be produced.

Creative projects require group buy-in

Design products frequently require the commentary and support of non-designers, holding aside fine art. Product designers collaborate with product managers; architects answer to their clients. In the end, despite the majority of stakeholders having very little artistic capability themselves, they provide invaluable (even if sometimes un-valuable) input which is required for success. This includes context on the high-level motivation, feedback on feasibility, down to gut-feel reactions based on “look-and-feel.”

But, fundamentally, people will not patronize a creative endeavor that is not understood. In the case of data science, people will not implement data-driven solutions that they do not understand or believe in.

As data scientists, there is generally an insufficient focus on collaboration. The problem originates from an acute lack of a shared common language. For design and UX, most product managers have at least some sense of the output that is desired. For data scientists, the vernacular is frequently unshared by stakeholders – even what “accuracy” means (vs. F1 score) can be ambiguous. But a data science project fails without stakeholders as quickly as app redesigns that don’t have product manager buy-in. This lack of stakeholder dialogue is why data science projects languish in states of complete-not-deployed or didn’t-change-anything.

Data scientists should prioritize winning commitments from stakeholders to actually learn some data science concepts. Patience here is vital – hold people to the fire, and if they don’t want to learn, convince their boss.

Inadequate software can break your team

Much unlike a Word document, which can be easily shared between teammates and also with stakeholders, design products have been historically hard to share. Creative software is expensive, whether Photoshop, film editing, CAD, or otherwise, and is only available to stakeholders. And they have so many buttons that only an expert in that software would be able to use the tool. This introduces a real barrier between the designer and any consumer of their designs.

But there are two advantages still to creative outputs versus data science. First, as noted above, the end product is immediately understandable by stakeholders, even if the intermediate product is not. A video, a UX design, a board of proposed ads – all of these are easier to consume than a pickled XGBoost model. Secondly, the software to view these output objects are typically pre-installed, whereas it’s not clear how to natively view data science outputs as a non-data scientist (PowerPoint?). And to a great extent, the success of modern tools like Figma shows design software’s movement towards providing a shared space with no environmental setup.

So data scientists need to find a way to get out of Jupyter notebooks into collaborative spaces. One team we have seen uses a collaborative whiteboard tool; another team commits to long Zoom calls. Modern reinventions of Python notebooks, like Einblick, are innovative here, by being cloud-based and having a collaborative presentation layer.

Develop psychic powers (sort of)

Two people stand, stargazingTwo people stand, stargazing

There is one caveat to collaboration and sharing. Steve Jobs is famous for rejecting the notion that Apple should provide customers what they want, instead opining that

“Our job is to figure out what [customers] are going to want before they do.”

Jobs also echoes Henry Ford, “If I had asked people what they wanted, they would have said faster horses.” An important distinction is that good design is not the product of consensus, but the distillation of common opinion. Conversation helps identify the ingredients and sell the end product but is not a replacement for individual creative intuition.

Most end consumers can only expect and ask for what they already know (or better versions of what they know), but that may not be the correct answer for a particular problem. Innovative designs, like the iPhone, may radically redraw the product form factor, change the interaction model, and/or revolutionize the user interface.

Similarly, for data scientists, the best output may not be the original ask. Does a project need a simple dashboard chart to be updated, or should we build a predictive model? Do we need to serve results from an endpoint, a spreadsheet, or in a database table? Are we planning to do a forecast, when the right approach would be a classification exercise? Don’t draw what other people say they want; infer what they really want.

Conclusion

In short, data science is both a technical profession and a creative craft. The success of a project is not necessarily dependent on technical rigor alone, just as artists do not become successful just through technique. As artists, we must invest time into creativity, discovery, and selling our vision.

About

Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.