Python Code Examples

Einblick Content Team - June 5th, 2023

One of the preprocessing functionalities that keras provides for natural language processing (NLP) is tokenization. This post will provide a quick start for keras' Tokenizer() class.

Einblick Content Team - June 1st, 2023

Scikit-learn's TF-IDF Vectorizer (Term Frequency - Inverse Document Frequency) turns raw documents into a matrix of TF-IDF features. This process combines the CountVectorizer and TF-IDF Transformer.

Einblick Content Team - May 24th, 2023

Part of natural language processing is determining the role of each word or token in a body of text. In the world of NLP, we call this process part-of-speech (POS) tagging. In this post we'll review the POS tagging function in NLTK called pos_tag().

Einblick Content Team - May 17th, 2023

As in our prior post, which focused on tokenization in NLTK, we'll do a similar walkthrough for spaCy, another popular NLP package in Python. We'll go through a few different ways you can tokenize your text, as well as additional commands you can use to get more information about each token.

Einblick Content Team - May 10th, 2023

Tokenization is the process of breaking up text into smaller units that can be more easily processed. In this post we'll review two functions from the nltk: word_tokenize() and sent_tokenize() so you can start processing your text data.

Einblick Content Team - May 10th, 2023

Removing stop words is an important step of processing any text data, particularly for tasks like sentiment analysis, where stop words have little semantic meaning, but can bloat your corpus. In this post, we'll go over how to remove and customize stop words using NLTK.

Einblick Content Team - May 4th, 2023

Aggregating data using one or more operations can be a really useful way to summarize large datasets. In this post, we'll cover how to use pandas' groupby() and agg() functions together so that you can easily summarize and aggregate your data.

Einblick Content Team - May 2nd, 2023

Heatmaps are a useful visualization for comparing variables or exploring the relationship between them. In this post, we utilize seaborn's heatmap() function and provide examples using a few key arguments that can ensure your heatmap conveys information effectively.

Einblick Content Team - April 28th, 2023

Timeit is a module in the Python standard library that provides various functions for measuring the execution times of small portions of code. In this post, we’ll go over the timeit module, its relevant functions, and a few examples.

Einblick Content Team - April 27th, 2023

In this post, we'll go over the process of one-hot encoding categorical variables using scikit-learn's OneHotEncoder() function. Before running any data science model, whether its a linear regression, decision tree, random forest, or any other model, it's important to properly prepare your data.

Einblick Content Team - April 25th, 2023

In the below example, we’re examining shoe sale data collected from Adidas retailers, with a focus on the operating margin. We’ll go over the basic syntax for using the ttest_1samp() function from SciPy, and some further information about t-tests.

Einblick Content Team - April 20th, 2023

In this post, we’ll show you how to conduct a two-sample t-test in Python using the SciPy library. We’ll cover the basic syntax, and a few key arguments you can use to further configure your hypothesis test.

Einblick Content Team - April 19th, 2023

pandas dropna() is a function used to remove rows or columns with missing values (NaN) from a DataFrame. There are several interesting arguments you can leverage to tailor how missing data is handled. In this post, we’ll review the axis, how, thresh, and subset arguments.

Einblick Content Team - April 14th, 2023

One way to iterate through values stored in iterable objects like lists, sets, and tuples is by creating an iterator from the iterable, and then using the next() function. In this post, we’ll go through an example using Python next() and the basic syntax.

Einblick Content Team - April 13th, 2023

In this post, we’ll go over how to create a confusion matrix in sci-kit learn. The first function will create the values for the 4 quadrants in a confusion matrix, and the second function will create a nicely formatted plot.

Einblick Content Team - April 12th, 2023

By combining multiple subplots into one figure, it is much easier to compare and contrast the results, while keeping everything organized and digestible. In this post, we’ll cover the subplots() function from matplotlib, and how to combine it with seaborn visualizations.

Einblick Content Team - April 6th, 2023

One of the most powerful functions in pandas is the groupby() function, which is an efficient way of summarizing large datasets with just a few lines of code. In this post, we’ll combine groupby() with the function, count(). We’ll cover basic syntax and a few examples, as well as compare count() and size().

Einblick Content Team - March 17th, 2023

Box plots or box-and-whisker plots are particularly useful in comparing distributions of continuous variables across groups, and identifying outliers. In this post, we’ll use seaborn’s boxplot() function to create and customize different box plots.

Einblick Content Team - March 16th, 2023

Histograms are a key visualization tool that help show the distribution of numerical data. Some histograms are easier than others to customize. This post will go over some of the many ways you can use seaborn’s histplot() function to create highly tuned and beautiful histograms.

Einblick Content Team - March 14th, 2023

In this post, we'll use matplotlib and seaborn together to create customized, beautiful axis labels, axis tick labels, and titles for your plots so that your data can speak for itself.

Einblick Content Team - March 8th, 2023

In this post, we’ll provide a comprehensive guide on seaborn’s scatterplot() function. We’ll cover a few key arguments, including hue, style, palette, and size that will help you create more compelling graphs.

Becca Weng - March 6th, 2023

This post will go over how to effectively visualize data using seaborn’s built-in lineplot() function. There are many parameters you can use to craft a more comprehensive data story, such as hue, style, markers, errorbar, err_style, and legend.

Einblick Content Team - March 1st, 2023

In data analysis, finding the global minimum of a function is a common task. However, it can be challenging to find the optimal solution due to the presence of multiple local minima. In this tutorial, we provide an example of using the scipy.optimize.basinhopping() function to find the global minimum of a one-dimensional multimodal function.

Einblick Content Team - February 28th, 2023

Learn how to perform constrained optimization using the scipy.optimize.minimize function. Get the best solution to your optimization problem while taking into consideration specific constraints on the solution.

Einblick Content Team - February 27th, 2023

In this tutorial, we'll explore how to minimize a function using the scipy.optimize.minimize function. By using this function, you can find the minimum value of a function, which is useful for optimization problems. We'll guide you through the steps of defining an objective function and key function arguments.

Einblick Content Team - February 24th, 2023

If your simple linear regression model exhibits heteroscedasticity, you can adjust the model to account for it in several ways. One way is to use weighted least squares (WLS) regression, which allows you to specify a weight for each data point. Check out this example using randomly generated data and the statsmodels library.

Einblick Content Team - February 13th, 2023

Decorators are a powerful and flexible feature of Python that allow you to modify the behavior of a function or method without modifying the base function’s underlying code or repeating the same code over and over again. In this post, we’ll go over basic syntax and an example that evaluates code performance.

Becca Weng - February 10th, 2023

There are many ways to check that there is constant variance of errors across values of the X variables in a regression model. This post will go over a visual way to check for homoscedasticity or to diagnose heteroskedasticity, using residual plots after you’ve built your linear regression model.

Einblick Content Team - February 9th, 2023

Ordinary least squares (OLS) is one of the classic regression techniques for a reason–the results are highly interpretable, but we have to ensure key model assumptions are met. This post will cover how to run the Breusch-Pagan test for heteroskedasticity using the statsmodels package.

Einblick Content Team - January 20th, 2023

Avoid unstable and unreliable model coefficients with this comprehensive guide to checking for multicollinearity in Python using seaborn and statsmodels. Learn about multicollinearity and how to use the variance inflation factor (VIF) and correlation coefficients.

Einblick Content Team - January 17th, 2023

Testing for heteroskedasticity (with a "k" or "c") is essential when running various regression models. For example, one of the main assumptions of OLS is that there is constant variance (homoscedasticity) among the residuals or errors of your linear regression model. Learn how to run and interpret White's test for heteroskedasticity using statsmodels.

Einblick Content Team - January 11th, 2023

In this post, we’ll review seaborn’s catplot() function, which is helpful for creating different kinds of plots to help you analyze and understand the relationships between continuous and categorical variables. We’ll go over how to use catplot() and some tips for customizing the appearance and layout of your plots.

Einblick Content Team - January 5th, 2023

In this post, we’ll be going over two ways to perform linear regression using ordinary least squares (OLS) estimation using the statsmodels library. Get a detailed summary of your model fit and access useful summary statistics with these simply functions.

Einblick Content Team - December 22nd, 2022

This code demonstrates how to use the ProcessPoolExecutor and ThreadPoolExecutor classes from the concurrent.futures module to run multiple threads and processes concurrently or in parallel to save you time.

Einblick Content Team - December 16th, 2022

NumPy arrays are stored in contiguous blocks of memory, which allows NumPy to take advantage of vectorization and other optimization techniques. Python lists are stored as individual objects in memory, which makes them less efficient and performant than NumPy arrays for numerical data.

Einblick Content Team - December 15th, 2022

One useful but not well-understood Python tip for data science is the use of generator expressions. Generator expressions are similar to list comprehensions, but they are more memory efficient because they do not create a new list object in memory.

Einblick Content Team - December 14th, 2022

Caching is a technique for storing the results of expensive computations so that they can be quickly retrieved later. In Python, you can actually use functools.lru_cache(), which stands for least recently used (LRU) cache, to easily add caching to a function.

Benedetto Buratti - November 18th, 2022

Fast Einblick Tools to make data manipulation faster. This first Tool series explores a sequence of Concat, Sort, and Join operations to manipulate and enrich customer data.