pandas basics: cast data types using pandas astype() function

Einblick Content Team - May 5th, 2023

pandas' astype() function is convenient for casting entire DataFrames, specific columns, or Series into different dtypes. In this post, we'll go over the basic syntax, and a few examples. The data we used comes from a Kaggle dataset on Goodreads.

Basic Syntax: df.astype(dtype)

The only argument you need is dtype, set to whatever data type you would like ALL of the data in your DataFrame to be. Let's check what the DataFrame looks like first.

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5495 entries, 0 to 5494
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   bookID              5495 non-null   int32  
 1   average_rating      5495 non-null   float32
 2   isbn13              5495 non-null   int64  
 3   num_pages           5495 non-null   int32  
 4   ratings_count       5495 non-null   int32  
 5   text_reviews_count  5495 non-null   int32  
dtypes: float32(1), int32(4), int64(1)
memory usage: 150.4 KB

We can see that there are 3 data types in the DataFrame initially. Now let's cast all of them to floats. We can use a string to designate the datatype, or just name the dtype.

Example 1: df.astype("float")

df2 = df.astype("float")
df2.dtypes

Output:

bookID                float64
average_rating        float64
isbn13                float64
num_pages             float64
ratings_count         float64
text_reviews_count    float64
dtype: object

Example 2: df.astype(float)

df3 = df.astype(float)
df3.dtypes

Output:

bookID                float64
average_rating        float64
isbn13                float64
num_pages             float64
ratings_count         float64
text_reviews_count    float64
dtype: object

In both cases, all of the data was successfully cast into floats.

NOTE: the function will raise an error if you cannot cast one type to another.

Example 3: ValueError

If we take all of the columns in the initial dataset, which includes strings, such as the title and authors of books, and try to cast the entire DataFrame into float, we get the following error.

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5495 entries, 0 to 5494
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   bookID              5495 non-null   int32  
 1   title               5495 non-null   string 
 2   authors             5495 non-null   string 
 3   average_rating      5495 non-null   float32
 4   isbn                5495 non-null   string 
 5   isbn13              5495 non-null   int64  
 6   language_code       5495 non-null   string 
 7   num_pages           5495 non-null   int32  
 8   ratings_count       5495 non-null   int32  
 9   text_reviews_count  5495 non-null   int32  
 10  publication_date    5495 non-null   string 
 11  publisher           5495 non-null   string 
dtypes: float32(1), int32(4), int64(1), string(6)
memory usage: 408.0 KB

Notice that there are several columns containing string data.

df.astype("float")

Output:

ValueError: could not convert string to float: 'Angle of Repose'

Casting specific columns

Example 1: df.astype({"col1": "dtype", "col2": "dtype"})

In this example, you call the astype() function on the DataFrame, df, as before, but instead of passing one dtype to apply to all the columns, you pass a dictionary where the key is the column name, and the value is the respective dtype to cast that column of data.

df4 = df.astype({"ratings_count": "int64", "text_reviews_count": "int64"})
df4.dtypes

Output:

bookID                  int32
title                  string
authors                string
average_rating        float32
isbn                   string
isbn13                  int64
language_code          string
num_pages               int32
ratings_count           int64
text_reviews_count      int64
publication_date       string
publisher              string
dtype: object

As you can see, the two columns ratings_count and text_reviews_count were successfully cast to int64.

Example 2: df["col"].astype(dtype)

Another option is to call the astype() function directly on the column. In the below example, we cast bookID into floats.

df["bookID"] = df["bookID"].astype(float)
df.dtypes

Output:

bookID                float64
title                  string
authors                string
average_rating        float32
isbn                   string
isbn13                  int64
language_code          string
num_pages               int32
ratings_count           int32
text_reviews_count      int32
publication_date       string
publisher              string
dtype: object

NOTE: astype() works for many data types, but for certain data types, such as datetime, you need your data to be in a specific format in order to call astype(). Otherwise, you'll have to use a more specialized casting function like to_datetime().

About

Einblick is an agile data science platform that provides data scientists with a collaborative workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick customers include Cisco, DARPA, Fuji, NetApp and USDA. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.

Start using Einblick

Pull all your data sources together, and build actionable insights on a single unified platform.

  • All connectors
  • Unlimited teammates
  • All operators