What is LangChain? Why Use LangChain?

Paul Yang - March 31st, 2023

One sentence answer: It's an open-source library that equips developers with the necessary tools to create applications powered by large language models (LLMs).

What is LangChain?

LangChain is an open-source library that provides developers with the tools to build applications powered by large language models (LLMs). More specifically, LangChain is an orchestration tool for prompts, making it easier for developers to chain different prompts interactively.

LLMs (like GPT3) provide a completion for a single prompt – you can think of it as getting a complete result for a single request. For example, you could say "bake me a cake," and the LLM would produce a cake. You can also give more complex commands like "bake me a vanilla cake with chocolate frosting," and the LLM would also likely return said cake.

But instead, what if you asked: “give me the ingredients you need to bake a cake and the steps to bake a cake”? (This does not happen with LLMs, but does happen with ChatGPT).

To avoid having the user manually give each step and determine the order of execution, we can use LLMs to generate the next step at each point, using the outputs of the previous steps as context.

In short, LangChain is a framework that can orchestrate a series of prompts to achieve a desired outcome. It offers an easy-to-use way for developers to work with LLMs. In reductionist terms, LangChain is a wrapper for using LLMs.

Why Use LangChain?

LLMs are already incredibly powerful when used with a single prompt, but they execute completions by guessing the most likely next word, rather than reasoning as humans do.

Reasoning, definitionally, is using existing knowledge to form new conclusions. We never consider “baking a cake” to be a single contiguous task, but a collection of smaller tasks that influence the subsequent.

LangChain is a framework that enables developers to build agents that can reason about problems and break them into smaller sub-tasks. With LangChain, we can introduce context and memory into completions by creating intermediate steps and chaining commands together

If I ask a LLM to tell me which stores were top-performing last week, it will generate a reasonable SQL query to pull my result, likely with fake, but real-looking column names.

But as a developer in LangChain, we can offer the LLM a choice of functions to use, and ask it to compose a workflow. Then, can actually go through the process and get back a single answer: “store #1317 in New York is top performing.”

  • What do I need for a SQL query to get the top-performing store?
    • You can create a set of functions including getTables(), getSchema(table) etc…
  • I have some table names, how do I get table schema?
  • Here are a bunch of tableSchemas, which one contains sales by store?
  • How do I query sales by store to get top-performing stores given a table with this schema?
  • These are the top rows, which store was best performing last week?

The extraordinarily cool part is that we rely on LLMs to generate each step/question and so there is no longer a need for human input to order these steps manually.

What’s So Exciting About LangChain?

At a high level, LangChain is exciting because it allows us to enhance already powerful LLMs with memory and context. We can introduce “reasoning” artificially and can solve more complex tasks with more accuracy.

LangChain is particularly exciting to developers because it provides a new way to build user interfaces. Instead of dragging and dropping or using code, users can simply ask for what they want.

For example, Microsoft Word has thousands of buttons, each mapping to some function. With frameworks like LangChain, we can combine the context of the words on the page, available functions, and our request to accomplish what we actually want to do. As a user, I never “want” to change the page number font size; I frequently have to, because buttons in Word represent atomic actions and the button to add page numbers adds it in Calibri 11pt. LangChain allows us to achieve goals directly, instead of performing atomic actions.

ChatGPT's success lies in the fact that it is not simply a naive implementation of GPT. Its answers are a product of feeding results back into itself several times. When a coding request is made, ChatGPT restates the request more formally, offers two implementations, describes the motivation for each, and explains the resulting code.

There is no way to explain the code before it was generated by the LLM; therefore the LLM completion to explain the code must have happened after.

How Are We Using LangChain?

LangChain can be particularly useful in the complex field of data science. Here are some reasons why data science is well-suited for LLM-augmented workflows:

  • There is a heavy reliance on different libraries, which can make remembering syntax challenging.
  • The focus is for the data scientist to make inferences about outputs, not on the code itself.
  • There is a broad audience of stakeholders who need insights but may not have the skills to self-serve.

Not only does it save time, but it also allows proficient data developers to think at a higher level and not be in the weeds. It’s like having a junior data scientist at all times to produce code and report the outputs to you.

As an example, let’s say that we ran some marketing campaign, and what we want to know is the 3 most important drivers of some outcome.

A LLM does a great job making boilerplate code, so we could take this, edit it with some coding knowledge, and get a pretty good result by the end.

import snowflake.connector
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV
import shap

# Connect to Snowflake
conn = snowflake.connector.connect(
user='<USERNAME>',    password='<PASSWORD>',    account='<ACCOUNT>',    warehouse='<WAREHOUSE>',    database='<DATABASE>',    schema='<SCHEMA>'
)

# Pull data from Snowflake table into pandas dataframe
query = 'SELECT * FROM <TABLE>'
df = pd.read_sql(query, conn)

# Split into train/test sets
X = df.drop('target_column', axis=1)
y = df['target_column']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define XGBoost model and parameters for RandomizedSearchCV
xgb_model = XGBClassifier()

parameters = {
'max_depth': range(3, 10),
'learning_rate': [0.01, 0.05, 0.1, 0.15, 0.2],
'n_estimators': range(50, 200, 25),
'gamma': [0, 0.1, 0.2, 0.3, 0.4],
'colsample_bytree': [0.5, 0.6, 0.7, 0.8, 0.9],
'subsample': [0.5, 0.6, 0.7, 0.8, 0.9]
}

# Train XGBoost model using RandomizedSearchCV
xgb_random = RandomizedSearchCV(estimator=xgb_model, param_distributions=parameters, n_iter=50, cv=5, verbose=0, random_state=42, n_jobs=-1)
xgb_random.fit(X_train, y_train)

# Get SHAP values
explainer = shap.Explainer(xgb_random.best_estimator_, X_train)
shap_values = explainer(X_test)

# Print top 3 most important features
shap.summary_plot(shap_values, X_test, plot_type="bar", max_display=3)

However, there is some syntactical wrangling, both to fit the functions to our dataset and a strong pre-requisite of data science knowledge to be successful. For instance, if there are text columns, I might have to ask the LLM for encoding, since it does not have context about my dataset.

In Einblick however, in addition to raw Python, we have many functions and widgets. This means we can mix-and-match SQL and Python with pre-configured data connectors, helper functions for getting table schema, and advanced tools like AutoML.

So rather than wasting time editing a long block of completion code, we can rely on pre-existing components to end up with user-friendly outputs.

Complex questions will require human experts still. But most data questions are not deep learning models. Frequently, the questions being asked are straightforward factual requests that just happen to require intermediate knowledge, mixing SQL/Python.

For example, a marketing director or product manager may need to know how many users logged in two weeks/months in a row. Most analytics tools cannot answer, and it does not make sense to dashboard every single possible metric.

In summary, LangChain is an exciting development for developers who want to build applications powered by LLMs. It provides a new way to build UIs. Particularly for us, it simplifies the complexity of data science, making it easier for more people to access the power of data. By using LangChain as an orchestration tool, developers can take advantage of the power of LLMs to build the next generation of software.

Curious to see it in action? Get on the waitlist for Einblick Prompt here and see how LangChain can be applied to data science workflows.

Start using Einblick

Pull all your data sources together, and build actionable insights on a single unified platform.

  • All connectors
  • Unlimited teammates
  • All operators