From April 19th to April 22nd, the Einblick team gathered in the one and only Salt Lake City, Utah to participate in and celebrate the 20th Anniversary of PyCon. Python has been the de facto language of data scientists and machine learning enthusiasts alike, and for good reason. So the team was excited to chat with Pythonistas and data scientists alike at our booth in the Expo Hall, and to spend some time in an environment made by and for this amazing community of contributors.
Being in the data space in the last year has meant that generative AI, large language models (LLMs), and natural language processing (NLP) have been top of mind. According to Ahrefs, as of May 2023, the term “openai” is searched in the US alone, about 212,000 times per month. It is unsurprising then, with this level of interest nationally and globally, that generative AI and large language models were a key part of the experience at PyCon.
These topics were not only emphasized in the talks given, but also in the pulse of the Expo Hall, where various companies, large and small, touted the newest OpenAI integrations and innovations. Of course, we are among those most excited by the spark OpenAI ignited just last year with Dall-E and ChatGPT. As a result, we wanted to highlight three talks and three companies that caught the team’s attention during PyCon.
Fairness, Bias, Privacy, and Agency
Almost overnight, it seemed that OpenAI and ChatGPT became daily topics of conversations in certain circles. With the onslaught of users, also came questions of use-cases, validity, and of course, skepticism, and even fear. As with all cutting-edge technology, we as individual consumers and creators need to evaluate the technology and put appropriate safeguards in place to protect our users (and ourselves) from abusing the latest and greatest.
In her talk, “Approaches to Fairness and Bias Mitigation in Natural Language Processing,” Angana Borah, a Master’s student in Computer Science from Georgia Tech focused on how human-generated data contains human bias, which is then ported over into whatever model we are training. She discussed different approaches to detect these biases and mitigate their effects. Fairness and bias have been hotly debated topics in the field of machine learning for decades now, and it’s encouraging to see these conversations continuing, regardless of how exciting recent boons in the field have been.
In a similar vein, Cape Privacy, one of the sponsors of this year’s PyCon, introduced a system to help users balance the utility of ChatGPT and LLMs while maintaining privacy. Their talk was aptly named “The ChatGPT Privacy Tango.” In the talk, they addressed personally identifiable information (PII) and maintaining data security. This is especially important to discuss, particularly for people new to models like ChatGPT or other vendors, who are now utilizing OpenAI. It’s important for people to learn how to protect their data, and ensure they know which third-party is gaining access to what information, and what information is being used by these models.
Pivoting slightly, we were excited by Deepset’s talk, which focused on “Building LLM-based Agents,” and highlighted their new search system built on large language models like GPT-4, called Haystack (more on this later). As many of us have experienced now firsthand, ChatGPT’s results can be witty and authoritative, and also sometimes, they can be very wrong. In part, this is because a model is only as good as the data it’s been trained on, and ChatGPT wasn’t built with any information beyond what was available online in 2021. So Deepset’s Agents are a solution to this problem. The Agent framework allows LLMs to react to user requests by determining which tool or knowledge base to call on based on the request. This creates “smarter” models, and ideally, better results.
This framework is relevant to a couple of academic papers that have come out recently, which also focus on making language models more dynamic and customizable since the use-cases appear so endless. For example, Meta AI Research and Universitat Pompeu Fabra researchers proposed a model called Toolformer, which “learns to use tools in…self-supervised way without requiring large amounts of human annotations.” Their approach focused on equipping API calls. You can read more in their paper, “Toolformer: Language Models Can Teach Themselves to Use Tools.” Additionally, researchers from Princeton University and the Google Research, Brain team, published a conference paper proposing a technique in which reasoning and acting in language models is treated concurrently, rather than as independent processes. By combining the two processes, the final outputs made by their approach, ReAct, outperformed other methods.
Einblick Prompt: LLMs meet Data Science
These conversations and bodies of research are particularly exciting, as we are in the middle of developing and launching a new addition to the Einblick canvas–Einblick Prompt. The core offering to our users is that they can build entire data workflows with just one sentence, or build on their existing work using natural language prompts, rather than fussing with tedious syntax. Prompt can create dynamic charts, regression models, and more. As long as you can verbalize the task, Prompt will build it for you, inside of our data science canvas. We believe that these advancements can speed up work for data scientists, allowing them to focus on extracting insights and delivering recommendations. Our team has been able to accomplish this by utilizing the concept of context in LLMs and chaining prompts (reasoning) to create the best workflows for our users.
Expo Hall: Meet the OpenAI Integrations
To close out our recap of PyCon, I wanted to highlight three companies that have developed OpenAI integrations.
Jina AI's Dev-GPT
Jina AI is a commercial, open source software company focused on providing multimodal AI services from neural search to prompt engineering to cloud hosting. Jina AI currently has an experimental repo called Dev GPT: Your Automated Development Team. The idea is an AI team made of a virtual product manager, developer, and DevOps that will create microservices automatically tailored to your needs.
You need to supply an OpenAI key, and you’ll need to pay OpenAI for the calls you’re making, so keep that in mind. In order to actually deploy the microservice that your virtual “team” built, you’ll need a Jina account. Deployment also costs money, but you start out with some free credits. Some microservice examples Jina AI provides are a compliment generator, chemical formula visualizer, product recommender, and meme generator.
You can check out more at their GitHub repo.
Deepset is an enterprise-ready ML/NLP platform. Their cloud features include custom NLP pipelines, prototypes, experiment tracking, deploying and monitoring, and data management. Originally released a few years ago, Haystack is an open-source framework that leverages the latest models like BERT, Cohere’s models, and OpenAI’s models, as well as customizable databases to build search systems that work over large sets of corpora. Tasks include question answering, retrieval, summarizing, and reranking.
You can also use Haystack to build applications that can answer complex customer queries via complex decision making. You can use existing models as is, or you can fine-tune them as needed for your unique data. Then you can use feedback to evaluate, benchmark, and improve your models. Check out more in their documentation.
Superblocks’ OpenAI integration
Superblocks is a programmable IDE built for developers. Their three main product offerings focus on building internal apps, workflows, and scheduled jobs. To enhance their existing solutions, Superblocks has created a well-designed UI on top of the OpenAI API, so you don’t have to make any of the calls yourself, but you can still use GPT-4 to transcribe files, generate emails, and adjust anything generated with some feedback.
Learn more in their blog post announcing the new integrations.
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. "Toolformer: Language Models Can Teach Themselves to Use Tools." 2023.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. "ReAct: Synergizing Reasoning and Acting in Language Models." In The Eleventh International Conference on Learning Representations. 2023.
Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.