Data Science Can Be More Accessible – But Training, Process, and Tools All Play a Part

Last week I attended the first Citizen Data Science Summit which MIT hosted. The conference was a great success, with over 850 people participating virtually and in-person to discuss wide-ranging topics, from the state of business intelligence tools to how data visualizations impacted public perception of COVID-19. As a software engineer who helps build Einblick to empower citizen data scientists with collaborative tools for data visualization and exploration, I wanted to share the topics that really stuck out to me and that I’d like to investigate further.

Conference highlights and insights

  • Data as a product – When derived data is in a system like Einblick it’s important to provide users a means of documenting, packaging, and publishing data so that it is more accessible to other users. Annotations and captions are helpful, but they do not replace features that enable robust documentation for datasets. Data as a product can break down “knowledge silos” and reduce communication overhead when data is collected and shared.
    Relevant Talks: Anthony Deighton, CPO, Anhai Doan, Professor
  • Testing & validation of data (and models) – Following the last point, many talks emphasized the importance of testing data – essentially, ensuring your assumptions about the data and its quality are correct. The same can be said for machine learning models. Working on a platform that regularly combines multiple datasets and acts as the place where subject matter experts explore the data with their technical counterparts, I found that this idea of data validation and provenance resonated. It can definitely cut down on unexpected and hard-to-debug errors that are ultimately problems with the data itself.
    Relevant Talks: Detlef Nauck, BT, Arvind Satyanarayan, Professor
  • Ensuring models are simple – This statement is debatable when considering all of data science, but the premise is in line with principles like Occam’s razor and  I still found it quite convincing.  Often, insights derived from an understanding of the model are more valuable than a powerful and accurate model itself. Thus, there’s a focus on better data and easy-to-understand models.
    Relevant Talks: Einblick Demo, Ben Lorica, Gradient Flow 
  • Emphasis on education of data literacy – Every talk from enterprise data science leaders mentioned using various resources to directly educate workers and the lengths they go to accomplish this. Specifically, teaching everyone how to frame/understand problems as a data scientist. Businesses and organizations need tools to introduce and improve concepts of data literacy to non-technical domain experts. One obvious gap for analytics software developers is to also help the ongoing education of its users, not just in learning the tool, but data science itself
    Relevant Talks: Natalie Morse, Torqata, Stefan Langenbach, Covestro
  • Before & after visualization – when creating a tool to clean or enrich data, it is useful to easily show a before & after of the data side-by-side. This highlights the changes made and the usefulness of the tool.
    Relevant Talks: Eugene Wu, Professor, Mike Cafarella, Principal Researcher
  • Enriching data using “publicly” available data – aka “batteries included”. Make commonly used datasets available in a clean and documented way. Don’t leave it to every user or group of users to have to reinvent the wheel.
    Relevant Talks: Tamr Demo, Michel Tricot, CEO Airbyte, Anthony Deighton, CPO

In summation, I was really excited to hear from leaders and organizations that are putting Citizen Data Science at the forefront and how they’re thinking about not just the bleeding edge of machine learning and data science, but how they embed analytics into every employee’s workflow in a way that creates long lasting flywheel effects. 

If you want to learn or see more, don’t hesitate to watch all the talks yourself: