{ "cells": [ { "cell_type": "markdown", "id": "ccc704a0", "metadata": { "id": "ccc704a0" }, "source": [ "
\n", "

Python Project Sprint - Part 1

\n", "
\n", "
\n", "
\n" ] }, { "cell_type": "markdown", "id": "tmN_dweRVy5b", "metadata": { "id": "tmN_dweRVy5b" }, "source": [ "# Your Experience Matters\n", "We require every learner to participate in a\n", "Course Experience Survey twice during their course so we can ensure you are meeting your learning goals and having the best possible experience.\n", "\n", "Please take 10 minutes to rate how you've enjoyed the Course so far by going to **Synapse → Experience Survey**." ] }, { "cell_type": "code", "execution_count": null, "id": "5d30b8a1", "metadata": { "id": "5d30b8a1" }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "pd.set_option('display.max_columns', None)" ] }, { "cell_type": "markdown", "id": "baec20d3", "metadata": { "id": "baec20d3" }, "source": [ "# E-commerce Dataset Overview\n", "\n", "In this Python Project Sprint, we'll be working with a dataset derived from an online e-commerce platform. The dataset captures customer purchase records, reflecting the shopping behavior and interactions of users. It includes important details such as purchase dates, product categories, amounts spent, customer demographics, and payment methods.\n", "\n", "This dataset is a valuable tool for:\n", "\n", "1. Identifying popular product categories and purchase trends.\n", "2. Analyzing customer spending patterns and behaviors.\n", "3. Examining transaction frequency and payment preferences.\n", "4. Building recommender systems and planning marketing strategies.\n", "\n", "You can find the data dictionary for this dataset below.\n", "\n", "---\n", "\n", "## Data Dictionary\n", "\n", "| Column Name | Data Type | Description |\n", "|---------------------------|-----------|-------------------------------------------------------------------------------------------------|\n", "| `Customer ID` | int | Unique identifier for each customer. |\n", "| `Age` | string | Customer's age in years. |\n", "| `Gender` | string | Gender of the customer. |\n", "| `Item Purchased` | string | Name of the item purchased by the customer. |\n", "| `Category` | string | Product category the item belongs to. |\n", "| `Purchase Amount (USD)` | float | The amount of money spent on the purchase, in USD. |\n", "| `Location` | string | The customer's location (state or region). |\n", "| `Size` | string | Size of the purchased item (e.g., S, M, L). |\n", "| `Color` | string | Color of the purchased item. |\n", "| `Discount Applied` | string | Indicates if a discount was applied (\"Yes\" or \"No\"). |\n", "| `Promo Code Used` | string | Indicates if a promotional code was used (\"Yes\" or \"No\"). |\n", "| `Previous Purchases` | float | Number of previous purchases made by the customer. |\n", "| `Payment Method` | string | Payment method used by the customer. |\n", "| `Frequency of Purchases` | string | Frequency of purchases by the customer (e.g., Weekly, Monthly). |\n", "| `user id` | float | ID of the user associated with the purchase. |\n", "| `product id` | string | Unique identifier for the purchased product. |\n", "| `Interaction type` | string | Type of interaction the customer had with the product (e.g., purchase, view, like). |\n", "| `Time stamp` | string | Date and time of the transaction or interaction (in DD/MM/YYYY HH:MM format). |" ] }, { "cell_type": "markdown", "id": "831a627a", "metadata": { "id": "831a627a" }, "source": [ "# Data Cleaning" ] }, { "cell_type": "markdown", "id": "e7dfda8a", "metadata": { "id": "e7dfda8a" }, "source": [ "To begin our analysis of the dataset, we will first inspect and clean it. The data cleaning process will involve adjusting data types, removing null values, removing duplicates, normalizing text columns, and dropping redundant columns.\n", "\n", "Firstly, we will load the data to do an initial inspection and determine the number of rows and columns it contains." ] }, { "cell_type": "code", "execution_count": null, "id": "43aa7855", "metadata": { "id": "43aa7855", "outputId": "e9d5067d-522f-40fa-a39c-05ae45194ff5" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0Customer IDAgeGenderItem PurchasedCategoryPurchase Amount (USD)LocationSizeColorSeasonReview RatingSubscription StatusShipping TypeDiscount AppliedPromo Code UsedPrevious PurchasesPayment MethodFrequency of Purchasesuser idproduct idInteraction typeTime stampUnnamed: 4
00155,0MalebLOuSEClothing53KeNtUCkYLGrayWinter3.1YesExpressYesYes14Venmo\\nFortnightly1.04c69b61db1fc16e7013b43fc926e502dpurchase10/10/2023 8:00NaN
11219,0MaleSWEaTerClothing64MaINeLMaroonWinter3.1YesExpressYesYes2Cash\\nFortnightly2.066d49bbed043f5be260fa9f7fbff5957view11/10/2023 8:00NaN
22350,0MalejeaNSClothing73MASSAcHUsetTsSMaroonSpring3.1YesFree ShippingYesYes23Credit Card\\nWeekly3.02c55cae269aebf53838484b0d7dd931alike12/10/2023 8:00NaN
33421,0MalesaNdalsFootwear90RHodE IsLAnDMMaroonSpring3.5YesNext Day AirYesYes49PayPal\\nWeekly4.018018b6bc416dab347b1b7db79994afaview13/10/2023 8:00NaN
44545,0MaleBLoUSEClothing49OReGONMTurquoiseSpring2.7YesFree ShippingYesYes31PayPal\\nAnnually5.0e04b990e95bf73bbe6a3fa09785d7cd0like14/10/2023 8:00NaN
\n", "
" ], "text/plain": [ " Unnamed: 0 Customer ID Age Gender Item Purchased Category \\\n", "0 0 1 55,0 Male bLOuSE Clothing \n", "1 1 2 19,0 Male SWEaTer Clothing \n", "2 2 3 50,0 Male jeaNS Clothing \n", "3 3 4 21,0 Male saNdals Footwear \n", "4 4 5 45,0 Male BLoUSE Clothing \n", "\n", " Purchase Amount (USD) Location Size Color Season \\\n", "0 53 KeNtUCkY L Gray Winter \n", "1 64 MaINe L Maroon Winter \n", "2 73 MASSAcHUsetTs S Maroon Spring \n", "3 90 RHodE IsLAnD M Maroon Spring \n", "4 49 OReGON M Turquoise Spring \n", "\n", " Review Rating Subscription Status Shipping Type Discount Applied \\\n", "0 3.1 Yes Express Yes \n", "1 3.1 Yes Express Yes \n", "2 3.1 Yes Free Shipping Yes \n", "3 3.5 Yes Next Day Air Yes \n", "4 2.7 Yes Free Shipping Yes \n", "\n", " Promo Code Used Previous Purchases Payment Method Frequency of Purchases \\\n", "0 Yes 14 Venmo\\n Fortnightly \n", "1 Yes 2 Cash\\n Fortnightly \n", "2 Yes 23 Credit Card\\n Weekly \n", "3 Yes 49 PayPal\\n Weekly \n", "4 Yes 31 PayPal\\n Annually \n", "\n", " user id product id Interaction type \\\n", "0 1.0 4c69b61db1fc16e7013b43fc926e502d purchase \n", "1 2.0 66d49bbed043f5be260fa9f7fbff5957 view \n", "2 3.0 2c55cae269aebf53838484b0d7dd931a like \n", "3 4.0 18018b6bc416dab347b1b7db79994afa view \n", "4 5.0 e04b990e95bf73bbe6a3fa09785d7cd0 like \n", "\n", " Time stamp Unnamed: 4 \n", "0 10/10/2023 8:00 NaN \n", "1 11/10/2023 8:00 NaN \n", "2 12/10/2023 8:00 NaN \n", "3 13/10/2023 8:00 NaN \n", "4 14/10/2023 8:00 NaN " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load the dataset\n", "df = pd.read_csv('data/e-commerce.csv')\n", "\n", "# Evaluate the dataset\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "07f1a306", "metadata": { "id": "07f1a306", "outputId": "b69ef2c3-223c-434b-ff62-59716c50c2ac" }, "outputs": [ { "data": { "text/plain": [ "(3009, 24)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Shape of dataset\n", "df.shape" ] }, { "cell_type": "markdown", "id": "9ede0b41", "metadata": { "id": "9ede0b41" }, "source": [ "The dataset has 3009 rows and 24 columns. By only examining the top rows, we can see redundant unnamed columns, some categorical columns containing a mixture of lowercase and uppercase letters that require normalization, and extra `,` in the **Age** column, and an extra newline character (`\\n`) in the **Payment Method** column.\n", "\n", "Next, let's analyze the data types of the columns." ] }, { "cell_type": "code", "execution_count": null, "id": "0ebad47a", "metadata": { "id": "0ebad47a", "outputId": "2c542136-e545-4b6c-fa1f-0a9abed8f413" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 3009 entries, 0 to 3008\n", "Data columns (total 24 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Unnamed: 0 3009 non-null int64 \n", " 1 Customer ID 3009 non-null int64 \n", " 2 Age 3009 non-null object \n", " 3 Gender 3009 non-null object \n", " 4 Item Purchased 3009 non-null object \n", " 5 Category 3009 non-null object \n", " 6 Purchase Amount (USD) 3009 non-null int64 \n", " 7 Location 3009 non-null object \n", " 8 Size 3009 non-null object \n", " 9 Color 3009 non-null object \n", " 10 Season 3009 non-null object \n", " 11 Review Rating 3009 non-null float64\n", " 12 Subscription Status 3009 non-null object \n", " 13 Shipping Type 3009 non-null object \n", " 14 Discount Applied 3009 non-null object \n", " 15 Promo Code Used 3009 non-null object \n", " 16 Previous Purchases 3009 non-null int64 \n", " 17 Payment Method 3009 non-null object \n", " 18 Frequency of Purchases 3009 non-null object \n", " 19 user id 3009 non-null float64\n", " 20 product id 3009 non-null object \n", " 21 Interaction type 2881 non-null object \n", " 22 Time stamp 3009 non-null object \n", " 23 Unnamed: 4 0 non-null float64\n", "dtypes: float64(3), int64(4), object(17)\n", "memory usage: 564.3+ KB\n" ] } ], "source": [ "# Let's now look at data types\n", "df.info()" ] }, { "cell_type": "markdown", "id": "9e2d4cf7", "metadata": { "id": "9e2d4cf7" }, "source": [ "We can see that the **Interaction type** column contains 2881 non-null values, indicating the presence of missing entries in this column. In terms of data type, we can see that all of the columns have the correct data type except for **Age**, which should be an integer, and **Time stamp**, which has to be datetime. We need to address these issues." ] }, { "cell_type": "markdown", "id": "d4d239a4", "metadata": { "id": "d4d239a4" }, "source": [ "Let's also explore some of the categorical columns and look into their unique values." ] }, { "cell_type": "code", "execution_count": null, "id": "f71f0423", "metadata": { "id": "f71f0423", "outputId": "17d728eb-daea-4c3d-c041-3d7fd91cc761", "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "Gender\n", "Male 2662\n", "Female 347\n", "Name: count, dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's get the unique values in column Gender\n", "df['Gender'].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "id": "97fb1029", "metadata": { "id": "97fb1029", "outputId": "dbbf3c2c-7bdc-42c8-a802-fde689aa8c2c", "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "Size\n", "M 1331\n", "L 817\n", "S 529\n", "XL 332\n", "Name: count, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's get the unique values in column Size\n", "df['Size'].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "id": "bec39f8f", "metadata": { "id": "bec39f8f", "outputId": "9b1a9d64-8c99-4331-a65c-b65107e2cd03", "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "Color\n", "Silver 139\n", "Teal 138\n", "Yellow 136\n", "Olive 134\n", "Black 133\n", "Cyan 133\n", "Violet 131\n", "Green 124\n", "Beige 121\n", "Purple 119\n", "Gray 118\n", "Indigo 118\n", "Orange 118\n", "Blue 117\n", "Maroon 117\n", "Charcoal 117\n", "Red 114\n", "Turquoise 113\n", "Peach 112\n", "Pink 111\n", "White 111\n", "Magenta 111\n", "Brown 109\n", "Lavender 109\n", "Gold 106\n", "Name: count, dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's get the unique values in column Color\n", "df['Color'].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "id": "b9a74a2f", "metadata": { "id": "b9a74a2f", "outputId": "b6ab395f-6708-44e6-b86b-a3df6c4fa36c", "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "Location\n", "OHIO 10\n", "IowA 9\n", "oHIO 7\n", "UtAH 7\n", "iowa 7\n", " ..\n", "MaSsaChUsetTS 1\n", "DelaWARe 1\n", "GEOrGiA 1\n", "oKLAhOMA 1\n", "WisCONSin 1\n", "Name: count, Length: 2465, dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's get the unique values in column Location\n", "df['Location'].value_counts()" ] }, { "cell_type": "markdown", "id": "0a4010cf", "metadata": { "id": "0a4010cf" }, "source": [ "## Remove Redundant Columns and Rename Columns" ] }, { "cell_type": "markdown", "id": "3bde53da", "metadata": { "id": "3bde53da" }, "source": [ "Let's first remove the redundant **Unnamed: 0** and **Unnamed: 4** columns, as we don't need them in our analysis. We can also drop the **Product id** column, as it shows the unique ID of the purchased item, and we won't use that either since we have the item name.\n", "\n", "Another column that we can remove is **user id**, since it's a duplicate of the **Customer ID** column." ] }, { "cell_type": "code", "execution_count": null, "id": "189f5fa1", "metadata": { "id": "189f5fa1" }, "outputs": [], "source": [ "# drop the columnsthat we don't need\n", "df.drop(['Unnamed: 0','product id', 'user id', 'Unnamed: 4'], axis=1, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "46040b02", "metadata": { "id": "46040b02", "outputId": "306f45a0-1fc9-4447-902e-6de82c5040f4" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Customer IDAgeGenderItem PurchasedCategoryPurchase Amount (USD)LocationSizeColorSeasonReview RatingSubscription StatusShipping TypeDiscount AppliedPromo Code UsedPrevious PurchasesPayment MethodFrequency of PurchasesInteraction typeTime stamp
0155,0MalebLOuSEClothing53KeNtUCkYLGrayWinter3.1YesExpressYesYes14Venmo\\nFortnightlypurchase10/10/2023 8:00
1219,0MaleSWEaTerClothing64MaINeLMaroonWinter3.1YesExpressYesYes2Cash\\nFortnightlyview11/10/2023 8:00
2350,0MalejeaNSClothing73MASSAcHUsetTsSMaroonSpring3.1YesFree ShippingYesYes23Credit Card\\nWeeklylike12/10/2023 8:00
3421,0MalesaNdalsFootwear90RHodE IsLAnDMMaroonSpring3.5YesNext Day AirYesYes49PayPal\\nWeeklyview13/10/2023 8:00
4545,0MaleBLoUSEClothing49OReGONMTurquoiseSpring2.7YesFree ShippingYesYes31PayPal\\nAnnuallylike14/10/2023 8:00
\n", "
" ], "text/plain": [ " Customer ID Age Gender Item Purchased Category Purchase Amount (USD) \\\n", "0 1 55,0 Male bLOuSE Clothing 53 \n", "1 2 19,0 Male SWEaTer Clothing 64 \n", "2 3 50,0 Male jeaNS Clothing 73 \n", "3 4 21,0 Male saNdals Footwear 90 \n", "4 5 45,0 Male BLoUSE Clothing 49 \n", "\n", " Location Size Color Season Review Rating Subscription Status \\\n", "0 KeNtUCkY L Gray Winter 3.1 Yes \n", "1 MaINe L Maroon Winter 3.1 Yes \n", "2 MASSAcHUsetTs S Maroon Spring 3.1 Yes \n", "3 RHodE IsLAnD M Maroon Spring 3.5 Yes \n", "4 OReGON M Turquoise Spring 2.7 Yes \n", "\n", " Shipping Type Discount Applied Promo Code Used Previous Purchases \\\n", "0 Express Yes Yes 14 \n", "1 Express Yes Yes 2 \n", "2 Free Shipping Yes Yes 23 \n", "3 Next Day Air Yes Yes 49 \n", "4 Free Shipping Yes Yes 31 \n", "\n", " Payment Method Frequency of Purchases Interaction type Time stamp \n", "0 Venmo\\n Fortnightly purchase 10/10/2023 8:00 \n", "1 Cash\\n Fortnightly view 11/10/2023 8:00 \n", "2 Credit Card\\n Weekly like 12/10/2023 8:00 \n", "3 PayPal\\n Weekly view 13/10/2023 8:00 \n", "4 PayPal\\n Annually like 14/10/2023 8:00 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's look at the dataset again to ensure index has been dropped\n", "df.head()" ] }, { "cell_type": "markdown", "id": "8feb8e6c", "metadata": { "id": "8feb8e6c" }, "source": [ "We can also rename the column **Time stamp** to **timestamp**, as this is more common in the industry. To do so, we can use the `.rename()` method, which takes in a dictionary with the old name as the key and the new name as the value." ] }, { "cell_type": "code", "execution_count": null, "id": "176303e8", "metadata": { "id": "176303e8", "outputId": "0d3d6697-250f-491b-ab80-a29e7f6fd3be" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Customer IDAgeGenderItem PurchasedCategoryPurchase Amount (USD)LocationSizeColorSeasonReview RatingSubscription StatusShipping TypeDiscount AppliedPromo Code UsedPrevious PurchasesPayment MethodFrequency of PurchasesInteraction typetimestamp
0155,0MalebLOuSEClothing53KeNtUCkYLGrayWinter3.1YesExpressYesYes14Venmo\\nFortnightlypurchase10/10/2023 8:00
1219,0MaleSWEaTerClothing64MaINeLMaroonWinter3.1YesExpressYesYes2Cash\\nFortnightlyview11/10/2023 8:00
2350,0MalejeaNSClothing73MASSAcHUsetTsSMaroonSpring3.1YesFree ShippingYesYes23Credit Card\\nWeeklylike12/10/2023 8:00
3421,0MalesaNdalsFootwear90RHodE IsLAnDMMaroonSpring3.5YesNext Day AirYesYes49PayPal\\nWeeklyview13/10/2023 8:00
4545,0MaleBLoUSEClothing49OReGONMTurquoiseSpring2.7YesFree ShippingYesYes31PayPal\\nAnnuallylike14/10/2023 8:00
\n", "
" ], "text/plain": [ " Customer ID Age Gender Item Purchased Category Purchase Amount (USD) \\\n", "0 1 55,0 Male bLOuSE Clothing 53 \n", "1 2 19,0 Male SWEaTer Clothing 64 \n", "2 3 50,0 Male jeaNS Clothing 73 \n", "3 4 21,0 Male saNdals Footwear 90 \n", "4 5 45,0 Male BLoUSE Clothing 49 \n", "\n", " Location Size Color Season Review Rating Subscription Status \\\n", "0 KeNtUCkY L Gray Winter 3.1 Yes \n", "1 MaINe L Maroon Winter 3.1 Yes \n", "2 MASSAcHUsetTs S Maroon Spring 3.1 Yes \n", "3 RHodE IsLAnD M Maroon Spring 3.5 Yes \n", "4 OReGON M Turquoise Spring 2.7 Yes \n", "\n", " Shipping Type Discount Applied Promo Code Used Previous Purchases \\\n", "0 Express Yes Yes 14 \n", "1 Express Yes Yes 2 \n", "2 Free Shipping Yes Yes 23 \n", "3 Next Day Air Yes Yes 49 \n", "4 Free Shipping Yes Yes 31 \n", "\n", " Payment Method Frequency of Purchases Interaction type timestamp \n", "0 Venmo\\n Fortnightly purchase 10/10/2023 8:00 \n", "1 Cash\\n Fortnightly view 11/10/2023 8:00 \n", "2 Credit Card\\n Weekly like 12/10/2023 8:00 \n", "3 PayPal\\n Weekly view 13/10/2023 8:00 \n", "4 PayPal\\n Annually like 14/10/2023 8:00 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's rename column Time stamp to timestamp\n", "df.rename(columns={'Time stamp': 'timestamp'}, inplace=True)\n", "df.head()" ] }, { "cell_type": "markdown", "id": "fef6e506", "metadata": { "id": "fef6e506" }, "source": [ "## Data Types" ] }, { "cell_type": "markdown", "id": "ecccfe6f", "metadata": { "id": "ecccfe6f" }, "source": [ "We can now correct the data type for each column. The column **timestamp** should have a `datetime` type, and **Age** should be an `int`. Let's first look at the data types once again." ] }, { "cell_type": "code", "execution_count": null, "id": "ab2669ec", "metadata": { "id": "ab2669ec", "outputId": "23227316-25ed-4b1e-e475-4dff3e099bf1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 3009 entries, 0 to 3008\n", "Data columns (total 20 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Customer ID 3009 non-null int64 \n", " 1 Age 3009 non-null object \n", " 2 Gender 3009 non-null object \n", " 3 Item Purchased 3009 non-null object \n", " 4 Category 3009 non-null object \n", " 5 Purchase Amount (USD) 3009 non-null int64 \n", " 6 Location 3009 non-null object \n", " 7 Size 3009 non-null object \n", " 8 Color 3009 non-null object \n", " 9 Season 3009 non-null object \n", " 10 Review Rating 3009 non-null float64\n", " 11 Subscription Status 3009 non-null object \n", " 12 Shipping Type 3009 non-null object \n", " 13 Discount Applied 3009 non-null object \n", " 14 Promo Code Used 3009 non-null object \n", " 15 Previous Purchases 3009 non-null int64 \n", " 16 Payment Method 3009 non-null object \n", " 17 Frequency of Purchases 3009 non-null object \n", " 18 Interaction type 2881 non-null object \n", " 19 timestamp 3009 non-null object \n", "dtypes: float64(1), int64(3), object(16)\n", "memory usage: 470.3+ KB\n" ] } ], "source": [ "# Let's look at the data types again\n", "df.info()" ] }, { "cell_type": "markdown", "id": "ee9b5ef8", "metadata": { "id": "ee9b5ef8" }, "source": [ "As we can see, we have to adjust the data type for the `Age` and `timestamp` columns. Let's first adjust the type for the `timestamp` column. We can use the `pd.to_datetime` function, which takes the column and the raw format for the datetime values. In this scenario, we can see that **timestamp** has the format of `dd/mm/yyyy hh:mm`. We should pass this to our function so it gets parsed properly." ] }, { "cell_type": "code", "execution_count": null, "id": "3a66f59d", "metadata": { "id": "3a66f59d" }, "outputs": [], "source": [ "# Correcting the data type for timestamp column\n", "df['timestamp'] = pd.to_datetime(df['timestamp'], format=\"%d/%m/%Y %H:%M\")" ] }, { "cell_type": "code", "execution_count": null, "id": "70db9de8", "metadata": { "id": "70db9de8", "outputId": "b4d98a7a-eedf-47eb-c65c-9379811671e9" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 3009 entries, 0 to 3008\n", "Data columns (total 20 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Customer ID 3009 non-null int64 \n", " 1 Age 3009 non-null object \n", " 2 Gender 3009 non-null object \n", " 3 Item Purchased 3009 non-null object \n", " 4 Category 3009 non-null object \n", " 5 Purchase Amount (USD) 3009 non-null int64 \n", " 6 Location 3009 non-null object \n", " 7 Size 3009 non-null object \n", " 8 Color 3009 non-null object \n", " 9 Season 3009 non-null object \n", " 10 Review Rating 3009 non-null float64 \n", " 11 Subscription Status 3009 non-null object \n", " 12 Shipping Type 3009 non-null object \n", " 13 Discount Applied 3009 non-null object \n", " 14 Promo Code Used 3009 non-null object \n", " 15 Previous Purchases 3009 non-null int64 \n", " 16 Payment Method 3009 non-null object \n", " 17 Frequency of Purchases 3009 non-null object \n", " 18 Interaction type 2881 non-null object \n", " 19 timestamp 3009 non-null datetime64[ns]\n", "dtypes: datetime64[ns](1), float64(1), int64(3), object(15)\n", "memory usage: 470.3+ KB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "markdown", "id": "1080d034", "metadata": { "id": "1080d034" }, "source": [ "For **Age**, we first need to replace **','** in the values with **'.'**. Otherwise, pandas won't be able to parse it. To do this, we will use the `.replace()` method in pandas, which allows us to replace a letter in a string with another letter. To use `.replace()`, we need to convert the column to a `string`, which can be done using the `.str` attribute. So, this process can be done with `df['Age'].str.replace(',', '.')`." ] }, { "cell_type": "code", "execution_count": null, "id": "b9f334ec", "metadata": { "id": "b9f334ec", "outputId": "3a21d070-0dbf-44d5-d0d3-d000b1306669" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Customer IDAgeGenderItem PurchasedCategoryPurchase Amount (USD)LocationSizeColorSeasonReview RatingSubscription StatusShipping TypeDiscount AppliedPromo Code UsedPrevious PurchasesPayment MethodFrequency of PurchasesInteraction typetimestamp
0155.0MalebLOuSEClothing53KeNtUCkYLGrayWinter3.1YesExpressYesYes14Venmo\\nFortnightlypurchase2023-10-10 08:00:00
1219.0MaleSWEaTerClothing64MaINeLMaroonWinter3.1YesExpressYesYes2Cash\\nFortnightlyview2023-10-11 08:00:00
2350.0MalejeaNSClothing73MASSAcHUsetTsSMaroonSpring3.1YesFree ShippingYesYes23Credit Card\\nWeeklylike2023-10-12 08:00:00
3421.0MalesaNdalsFootwear90RHodE IsLAnDMMaroonSpring3.5YesNext Day AirYesYes49PayPal\\nWeeklyview2023-10-13 08:00:00
4545.0MaleBLoUSEClothing49OReGONMTurquoiseSpring2.7YesFree ShippingYesYes31PayPal\\nAnnuallylike2023-10-14 08:00:00
\n", "
" ], "text/plain": [ " Customer ID Age Gender Item Purchased Category Purchase Amount (USD) \\\n", "0 1 55.0 Male bLOuSE Clothing 53 \n", "1 2 19.0 Male SWEaTer Clothing 64 \n", "2 3 50.0 Male jeaNS Clothing 73 \n", "3 4 21.0 Male saNdals Footwear 90 \n", "4 5 45.0 Male BLoUSE Clothing 49 \n", "\n", " Location Size Color Season Review Rating Subscription Status \\\n", "0 KeNtUCkY L Gray Winter 3.1 Yes \n", "1 MaINe L Maroon Winter 3.1 Yes \n", "2 MASSAcHUsetTs S Maroon Spring 3.1 Yes \n", "3 RHodE IsLAnD M Maroon Spring 3.5 Yes \n", "4 OReGON M Turquoise Spring 2.7 Yes \n", "\n", " Shipping Type Discount Applied Promo Code Used Previous Purchases \\\n", "0 Express Yes Yes 14 \n", "1 Express Yes Yes 2 \n", "2 Free Shipping Yes Yes 23 \n", "3 Next Day Air Yes Yes 49 \n", "4 Free Shipping Yes Yes 31 \n", "\n", " Payment Method Frequency of Purchases Interaction type timestamp \n", "0 Venmo\\n Fortnightly purchase 2023-10-10 08:00:00 \n", "1 Cash\\n Fortnightly view 2023-10-11 08:00:00 \n", "2 Credit Card\\n Weekly like 2023-10-12 08:00:00 \n", "3 PayPal\\n Weekly view 2023-10-13 08:00:00 \n", "4 PayPal\\n Annually like 2023-10-14 08:00:00 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Replalce ',' with '.' in Age column\n", "df['Age'] = df['Age'].str.replace(',','.')\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "id": "faf17068", "metadata": { "id": "faf17068" }, "source": [ "We can now convert the **Age** column to `int` using the `.astype()` function. This function takes the data type as input, i.e., `df['Age'].astype(float)`.\n", "\n", "In this case, we first have to convert it to `float` and then to `int`." ] }, { "cell_type": "code", "execution_count": null, "id": "b11070cb", "metadata": { "id": "b11070cb" }, "outputs": [], "source": [ "# Conver Age to int type\n", "df['Age'] = df['Age'].astype('float')\n", "df['Age'] = df['Age'].astype('int')" ] }, { "cell_type": "code", "execution_count": null, "id": "728f0cc9", "metadata": { "id": "728f0cc9", "outputId": "6495a39b-2c4b-4e53-c329-91bfb554819c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 3009 entries, 0 to 3008\n", "Data columns (total 20 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Customer ID 3009 non-null int64 \n", " 1 Age 3009 non-null int64 \n", " 2 Gender 3009 non-null object \n", " 3 Item Purchased 3009 non-null object \n", " 4 Category 3009 non-null object \n", " 5 Purchase Amount (USD) 3009 non-null int64 \n", " 6 Location 3009 non-null object \n", " 7 Size 3009 non-null object \n", " 8 Color 3009 non-null object \n", " 9 Season 3009 non-null object \n", " 10 Review Rating 3009 non-null float64 \n", " 11 Subscription Status 3009 non-null object \n", " 12 Shipping Type 3009 non-null object \n", " 13 Discount Applied 3009 non-null object \n", " 14 Promo Code Used 3009 non-null object \n", " 15 Previous Purchases 3009 non-null int64 \n", " 16 Payment Method 3009 non-null object \n", " 17 Frequency of Purchases 3009 non-null object \n", " 18 Interaction type 2881 non-null object \n", " 19 timestamp 3009 non-null datetime64[ns]\n", "dtypes: datetime64[ns](1), float64(1), int64(4), object(14)\n", "memory usage: 470.3+ KB\n" ] } ], "source": [ "# Let's look at the data types after converting timestamp and Age to ensure that it was done properly\n", "df.info()" ] }, { "cell_type": "markdown", "id": "8211b54c", "metadata": { "id": "8211b54c" }, "source": [ "## Text Normalization" ] }, { "cell_type": "markdown", "id": "6157f8f0", "metadata": { "id": "6157f8f0" }, "source": [ "Let's now focus on the **Item Purchased**, **Payment Method**, and **Location** columns. The values in these columns need to be normalized. In the **Payment Method** column, any redundant newline characters (`\\n`) should be removed, and in all columns, all values should be converted to lowercase." ] }, { "cell_type": "code", "execution_count": null, "id": "fd52dd20", "metadata": { "id": "fd52dd20" }, "outputs": [], "source": [ "# Let's first remove \\n in Payment Method column\n", "df['Payment Method'] = df['Payment Method'].str.replace('\\n', '').str.strip()" ] }, { "cell_type": "markdown", "id": "25381a5e", "metadata": { "id": "25381a5e" }, "source": [ "To lowercase the values of a column, we can use the `.lower()` method. Remember that the values should be converted to `string` first using the `.str` function." ] }, { "cell_type": "code", "execution_count": null, "id": "2bac7f9c", "metadata": { "id": "2bac7f9c" }, "outputs": [], "source": [ "# Let's now lowercase values in columns Payment Method, Item Purchased and Location\n", "df['Payment Method'] = df['Payment Method'].str.lower()\n", "df['Item Purchased'] = df['Item Purchased'].str.strip().str.lower()\n", "df['Location'] = df['Location'].str.strip().str.lower()" ] }, { "cell_type": "code", "execution_count": null, "id": "e25be7ac", "metadata": { "id": "e25be7ac", "outputId": "f2f775d6-44aa-44d4-fd5c-98235bbd228c" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Customer IDAgeGenderItem PurchasedCategoryPurchase Amount (USD)LocationSizeColorSeasonReview RatingSubscription StatusShipping TypeDiscount AppliedPromo Code UsedPrevious PurchasesPayment MethodFrequency of PurchasesInteraction typetimestamp
0155MaleblouseClothing53kentuckyLGrayWinter3.1YesExpressYesYes14venmoFortnightlypurchase2023-10-10 08:00:00
1219MalesweaterClothing64maineLMaroonWinter3.1YesExpressYesYes2cashFortnightlyview2023-10-11 08:00:00
2350MalejeansClothing73massachusettsSMaroonSpring3.1YesFree ShippingYesYes23credit cardWeeklylike2023-10-12 08:00:00
3421MalesandalsFootwear90rhode islandMMaroonSpring3.5YesNext Day AirYesYes49paypalWeeklyview2023-10-13 08:00:00
4545MaleblouseClothing49oregonMTurquoiseSpring2.7YesFree ShippingYesYes31paypalAnnuallylike2023-10-14 08:00:00
\n", "
" ], "text/plain": [ " Customer ID Age Gender Item Purchased Category Purchase Amount (USD) \\\n", "0 1 55 Male blouse Clothing 53 \n", "1 2 19 Male sweater Clothing 64 \n", "2 3 50 Male jeans Clothing 73 \n", "3 4 21 Male sandals Footwear 90 \n", "4 5 45 Male blouse Clothing 49 \n", "\n", " Location Size Color Season Review Rating Subscription Status \\\n", "0 kentucky L Gray Winter 3.1 Yes \n", "1 maine L Maroon Winter 3.1 Yes \n", "2 massachusetts S Maroon Spring 3.1 Yes \n", "3 rhode island M Maroon Spring 3.5 Yes \n", "4 oregon M Turquoise Spring 2.7 Yes \n", "\n", " Shipping Type Discount Applied Promo Code Used Previous Purchases \\\n", "0 Express Yes Yes 14 \n", "1 Express Yes Yes 2 \n", "2 Free Shipping Yes Yes 23 \n", "3 Next Day Air Yes Yes 49 \n", "4 Free Shipping Yes Yes 31 \n", "\n", " Payment Method Frequency of Purchases Interaction type timestamp \n", "0 venmo Fortnightly purchase 2023-10-10 08:00:00 \n", "1 cash Fortnightly view 2023-10-11 08:00:00 \n", "2 credit card Weekly like 2023-10-12 08:00:00 \n", "3 paypal Weekly view 2023-10-13 08:00:00 \n", "4 paypal Annually like 2023-10-14 08:00:00 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's look at the dataset once again\n", "df.head()" ] }, { "cell_type": "markdown", "id": "d564c5ec", "metadata": { "id": "d564c5ec" }, "source": [ "We can now uppercase the first letter of **Payment Method**, **Item Purchased**, and **Location** columns. To do so, we use the `.apply()` method which allows use to apply a function onto the values of a column.\n", "\n", "Let's first define a function that takes in a string and capitalizes the first letter of that." ] }, { "cell_type": "code", "execution_count": null, "id": "f1300c61", "metadata": { "id": "f1300c61" }, "outputs": [], "source": [ "# Define the uppercase function that takes in a string and capitalizes the first letter of that\n", "def uppercase(string):\n", "\n", " uppercase = string[0].upper() + string[1:]\n", " return uppercase\n" ] }, { "cell_type": "code", "execution_count": null, "id": "a3a770a8", "metadata": { "id": "a3a770a8" }, "outputs": [], "source": [ "# Apply the function to Payment Method, Item Purchased, and Location columns with .apply() method in Pandas\n", "df['Payment Method'] = df['Payment Method'].apply(uppercase)\n", "df['Item Purchased'] = df['Item Purchased'].apply(uppercase)\n", "df['Location'] = df['Location'].apply(uppercase)" ] }, { "cell_type": "code", "execution_count": null, "id": "644736a9", "metadata": { "id": "644736a9", "outputId": "e16f9fa0-3176-4507-de74-3f651abfb386" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Customer IDAgeGenderItem PurchasedCategoryPurchase Amount (USD)LocationSizeColorSeasonReview RatingSubscription StatusShipping TypeDiscount AppliedPromo Code UsedPrevious PurchasesPayment MethodFrequency of PurchasesInteraction typetimestamp
0155MaleBlouseClothing53KentuckyLGrayWinter3.1YesExpressYesYes14VenmoFortnightlypurchase2023-10-10 08:00:00
1219MaleSweaterClothing64MaineLMaroonWinter3.1YesExpressYesYes2CashFortnightlyview2023-10-11 08:00:00
2350MaleJeansClothing73MassachusettsSMaroonSpring3.1YesFree ShippingYesYes23Credit cardWeeklylike2023-10-12 08:00:00
3421MaleSandalsFootwear90Rhode islandMMaroonSpring3.5YesNext Day AirYesYes49PaypalWeeklyview2023-10-13 08:00:00
4545MaleBlouseClothing49OregonMTurquoiseSpring2.7YesFree ShippingYesYes31PaypalAnnuallylike2023-10-14 08:00:00
\n", "
" ], "text/plain": [ " Customer ID Age Gender Item Purchased Category Purchase Amount (USD) \\\n", "0 1 55 Male Blouse Clothing 53 \n", "1 2 19 Male Sweater Clothing 64 \n", "2 3 50 Male Jeans Clothing 73 \n", "3 4 21 Male Sandals Footwear 90 \n", "4 5 45 Male Blouse Clothing 49 \n", "\n", " Location Size Color Season Review Rating Subscription Status \\\n", "0 Kentucky L Gray Winter 3.1 Yes \n", "1 Maine L Maroon Winter 3.1 Yes \n", "2 Massachusetts S Maroon Spring 3.1 Yes \n", "3 Rhode island M Maroon Spring 3.5 Yes \n", "4 Oregon M Turquoise Spring 2.7 Yes \n", "\n", " Shipping Type Discount Applied Promo Code Used Previous Purchases \\\n", "0 Express Yes Yes 14 \n", "1 Express Yes Yes 2 \n", "2 Free Shipping Yes Yes 23 \n", "3 Next Day Air Yes Yes 49 \n", "4 Free Shipping Yes Yes 31 \n", "\n", " Payment Method Frequency of Purchases Interaction type timestamp \n", "0 Venmo Fortnightly purchase 2023-10-10 08:00:00 \n", "1 Cash Fortnightly view 2023-10-11 08:00:00 \n", "2 Credit card Weekly like 2023-10-12 08:00:00 \n", "3 Paypal Weekly view 2023-10-13 08:00:00 \n", "4 Paypal Annually like 2023-10-14 08:00:00 " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "id": "2ea70b88", "metadata": { "id": "2ea70b88" }, "source": [ "As we can see, all of the columns are processed and ready. We can now look into some further cleaning by inspecting missing values and duplicates in the dataset." ] }, { "cell_type": "markdown", "id": "d4e4a321", "metadata": { "id": "d4e4a321" }, "source": [ "## Missing Values" ] }, { "cell_type": "markdown", "id": "284e573a", "metadata": { "id": "284e573a" }, "source": [ "Let's now look into the missing values in all columns." ] }, { "cell_type": "code", "execution_count": null, "id": "717d316a", "metadata": { "id": "717d316a", "outputId": "66167b3c-e8e5-4b09-c0bd-5b6126b6ae6c" }, "outputs": [ { "data": { "text/plain": [ "Customer ID 0.000000\n", "Age 0.000000\n", "Gender 0.000000\n", "Item Purchased 0.000000\n", "Category 0.000000\n", "Purchase Amount (USD) 0.000000\n", "Location 0.000000\n", "Size 0.000000\n", "Color 0.000000\n", "Season 0.000000\n", "Review Rating 0.000000\n", "Subscription Status 0.000000\n", "Shipping Type 0.000000\n", "Discount Applied 0.000000\n", "Promo Code Used 0.000000\n", "Previous Purchases 0.000000\n", "Payment Method 0.000000\n", "Frequency of Purchases 0.000000\n", "Interaction type 4.253905\n", "timestamp 0.000000\n", "dtype: float64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's calculate the ratio of missing values in each column\n", "(df.isna().sum() / df.shape[0]) * 100" ] }, { "cell_type": "markdown", "id": "7af34c0e", "metadata": { "id": "7af34c0e" }, "source": [ "As we can see, we have missing values in the **Interaction type** column, which account for only 4.25% of the dataset. This is negligible, and we can remove them from our dataset as it will not impact our results significantly." ] }, { "cell_type": "code", "execution_count": null, "id": "39fc747b", "metadata": { "id": "39fc747b" }, "outputs": [], "source": [ "# Let's drop the missing values\n", "df.dropna(inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "4086bd3c", "metadata": { "id": "4086bd3c", "outputId": "32e3f1d7-415f-4bc4-8b0a-1e43523cdc6e" }, "outputs": [ { "data": { "text/plain": [ "Customer ID 0\n", "Age 0\n", "Gender 0\n", "Item Purchased 0\n", "Category 0\n", "Purchase Amount (USD) 0\n", "Location 0\n", "Size 0\n", "Color 0\n", "Season 0\n", "Review Rating 0\n", "Subscription Status 0\n", "Shipping Type 0\n", "Discount Applied 0\n", "Promo Code Used 0\n", "Previous Purchases 0\n", "Payment Method 0\n", "Frequency of Purchases 0\n", "Interaction type 0\n", "timestamp 0\n", "dtype: int64" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's inspect the dataset after dropping missing values\n", "df.isna().sum()" ] }, { "cell_type": "markdown", "id": "60537b2e", "metadata": { "id": "60537b2e" }, "source": [ "## Duplicates" ] }, { "cell_type": "markdown", "id": "b0453c50", "metadata": { "id": "b0453c50" }, "source": [ "Having addressed the missing values and adjusted the data types, we can now proceed to check for duplicated rows within the dataset to ensure there are no redundancies." ] }, { "cell_type": "code", "execution_count": null, "id": "1f14494b", "metadata": { "id": "1f14494b", "outputId": "838e3dbd-d9cb-407d-f26b-a525fed0744a" }, "outputs": [ { "data": { "text/plain": [ "10" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's calculate how many duplicated rows we have in the dataset\n", "df.duplicated().sum()" ] }, { "cell_type": "markdown", "id": "b15169c9", "metadata": { "id": "b15169c9" }, "source": [ "Given that there are 10 duplicated rows in the dataset, these may indicate suspicious records, such as potential double charges from the vendor for the same purchase. Although analyzing these could be complex and would require a detailed review to confirm their nature, we can tentatively assume these might be system errors." ] }, { "cell_type": "markdown", "id": "d4bb6778", "metadata": { "id": "d4bb6778" }, "source": [ "Before proceeding to remove any missing values, let's first examine these duplicated rows to understand their characteristics and ensure that removing them is the correct action." ] }, { "cell_type": "code", "execution_count": null, "id": "9327026b", "metadata": { "id": "9327026b", "outputId": "ff910d90-1772-4b5d-d8d2-e16c82f6998c", "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Customer IDAgeGenderItem PurchasedCategoryPurchase Amount (USD)LocationSizeColorSeasonReview RatingSubscription StatusShipping TypeDiscount AppliedPromo Code UsedPrevious PurchasesPayment MethodFrequency of PurchasesInteraction typetimestamp
33033146MaleSweaterClothing51North dakotaMBeigeWinter3.7YesExpressYesYes43VenmoBi-Weeklyview2023-05-17 10:00:00
300733146MaleSweaterClothing51North dakotaMBeigeWinter3.7YesExpressYesYes43VenmoBi-Weeklyview2023-05-17 10:00:00
46646738MaleJeansClothing24VermontLTurquoiseWinter4.6YesFree ShippingYesYes25PaypalWeeklyview2023-09-30 10:00:00
300546738MaleJeansClothing24VermontLTurquoiseWinter4.6YesFree ShippingYesYes25PaypalWeeklyview2023-09-30 10:00:00
86086126MaleShoesFootwear87South carolinaMMaroonSpring4.1YesStandardYesYes3CashMonthlyview2022-12-28 10:00:00
299986126MaleShoesFootwear87South carolinaMMaroonSpring4.1YesStandardYesYes3CashMonthlyview2022-12-28 10:00:00
1095109619MaleCoatOuterwear34TennesseeSPeachSummer3.0NoFree ShippingYesYes30Bank transferQuarterlylike2023-12-18 08:00:00
3002109619MaleCoatOuterwear34TennesseeSPeachSummer3.0NoFree ShippingYesYes30Bank transferQuarterlylike2023-12-18 08:00:00
1130113167MaleJewelryAccessories41KentuckyMLavenderSummer3.9NoStandardYesYes30PaypalBi-Weeklypurchase2023-01-20 10:00:00
3001113167MaleJewelryAccessories41KentuckyMLavenderSummer3.9NoStandardYesYes30PaypalBi-Weeklypurchase2023-01-20 10:00:00
1238123955MalePantsClothing77WyomingMVioletWinter3.7NoExpressYesYes3VenmoFortnightlyview2023-01-19 10:00:00
3006123955MalePantsClothing77WyomingMVioletWinter3.7NoExpressYesYes3VenmoFortnightlyview2023-01-19 10:00:00
1294129527MaleShirtClothing64TexasMSilverSummer4.5NoStandardYesYes21CashEvery 3 Monthspurchase2023-11-01 08:00:00
3000129527MaleShirtClothing64TexasMSilverSummer4.5NoStandardYesYes21CashEvery 3 Monthspurchase2023-11-01 08:00:00
3008148328MaleGlovesAccessories67WisconsinSMagentaSpring4.1No2-Day ShippingYesYes43Bank transferWeeklylike2023-01-17 10:00:00
1482148328MaleGlovesAccessories67WisconsinSMagentaSpring4.1No2-Day ShippingYesYes43Bank transferWeeklylike2023-01-17 10:00:00
1638163968MaleJacketOuterwear51TennesseeMBlackSpring4.6NoFree ShippingYesYes18VenmoWeeklyview2023-11-09 08:00:00
3003163968MaleJacketOuterwear51TennesseeMBlackSpring4.6NoFree ShippingYesYes18VenmoWeeklyview2023-11-09 08:00:00
3004217031MaleCoatOuterwear42FloridaMBrownFall2.9NoNext Day AirNoNo32Bank transferBi-Weeklylike2023-03-30 10:00:00
2169217031MaleCoatOuterwear42FloridaMBrownFall2.9NoNext Day AirNoNo32Bank transferBi-Weeklylike2023-03-30 10:00:00
\n", "
" ], "text/plain": [ " Customer ID Age Gender Item Purchased Category \\\n", "330 331 46 Male Sweater Clothing \n", "3007 331 46 Male Sweater Clothing \n", "466 467 38 Male Jeans Clothing \n", "3005 467 38 Male Jeans Clothing \n", "860 861 26 Male Shoes Footwear \n", "2999 861 26 Male Shoes Footwear \n", "1095 1096 19 Male Coat Outerwear \n", "3002 1096 19 Male Coat Outerwear \n", "1130 1131 67 Male Jewelry Accessories \n", "3001 1131 67 Male Jewelry Accessories \n", "1238 1239 55 Male Pants Clothing \n", "3006 1239 55 Male Pants Clothing \n", "1294 1295 27 Male Shirt Clothing \n", "3000 1295 27 Male Shirt Clothing \n", "3008 1483 28 Male Gloves Accessories \n", "1482 1483 28 Male Gloves Accessories \n", "1638 1639 68 Male Jacket Outerwear \n", "3003 1639 68 Male Jacket Outerwear \n", "3004 2170 31 Male Coat Outerwear \n", "2169 2170 31 Male Coat Outerwear \n", "\n", " Purchase Amount (USD) Location Size Color Season \\\n", "330 51 North dakota M Beige Winter \n", "3007 51 North dakota M Beige Winter \n", "466 24 Vermont L Turquoise Winter \n", "3005 24 Vermont L Turquoise Winter \n", "860 87 South carolina M Maroon Spring \n", "2999 87 South carolina M Maroon Spring \n", "1095 34 Tennessee S Peach Summer \n", "3002 34 Tennessee S Peach Summer \n", "1130 41 Kentucky M Lavender Summer \n", "3001 41 Kentucky M Lavender Summer \n", "1238 77 Wyoming M Violet Winter \n", "3006 77 Wyoming M Violet Winter \n", "1294 64 Texas M Silver Summer \n", "3000 64 Texas M Silver Summer \n", "3008 67 Wisconsin S Magenta Spring \n", "1482 67 Wisconsin S Magenta Spring \n", "1638 51 Tennessee M Black Spring \n", "3003 51 Tennessee M Black Spring \n", "3004 42 Florida M Brown Fall \n", "2169 42 Florida M Brown Fall \n", "\n", " Review Rating Subscription Status Shipping Type Discount Applied \\\n", "330 3.7 Yes Express Yes \n", "3007 3.7 Yes Express Yes \n", "466 4.6 Yes Free Shipping Yes \n", "3005 4.6 Yes Free Shipping Yes \n", "860 4.1 Yes Standard Yes \n", "2999 4.1 Yes Standard Yes \n", "1095 3.0 No Free Shipping Yes \n", "3002 3.0 No Free Shipping Yes \n", "1130 3.9 No Standard Yes \n", "3001 3.9 No Standard Yes \n", "1238 3.7 No Express Yes \n", "3006 3.7 No Express Yes \n", "1294 4.5 No Standard Yes \n", "3000 4.5 No Standard Yes \n", "3008 4.1 No 2-Day Shipping Yes \n", "1482 4.1 No 2-Day Shipping Yes \n", "1638 4.6 No Free Shipping Yes \n", "3003 4.6 No Free Shipping Yes \n", "3004 2.9 No Next Day Air No \n", "2169 2.9 No Next Day Air No \n", "\n", " Promo Code Used Previous Purchases Payment Method \\\n", "330 Yes 43 Venmo \n", "3007 Yes 43 Venmo \n", "466 Yes 25 Paypal \n", "3005 Yes 25 Paypal \n", "860 Yes 3 Cash \n", "2999 Yes 3 Cash \n", "1095 Yes 30 Bank transfer \n", "3002 Yes 30 Bank transfer \n", "1130 Yes 30 Paypal \n", "3001 Yes 30 Paypal \n", "1238 Yes 3 Venmo \n", "3006 Yes 3 Venmo \n", "1294 Yes 21 Cash \n", "3000 Yes 21 Cash \n", "3008 Yes 43 Bank transfer \n", "1482 Yes 43 Bank transfer \n", "1638 Yes 18 Venmo \n", "3003 Yes 18 Venmo \n", "3004 No 32 Bank transfer \n", "2169 No 32 Bank transfer \n", "\n", " Frequency of Purchases Interaction type timestamp \n", "330 Bi-Weekly view 2023-05-17 10:00:00 \n", "3007 Bi-Weekly view 2023-05-17 10:00:00 \n", "466 Weekly view 2023-09-30 10:00:00 \n", "3005 Weekly view 2023-09-30 10:00:00 \n", "860 Monthly view 2022-12-28 10:00:00 \n", "2999 Monthly view 2022-12-28 10:00:00 \n", "1095 Quarterly like 2023-12-18 08:00:00 \n", "3002 Quarterly like 2023-12-18 08:00:00 \n", "1130 Bi-Weekly purchase 2023-01-20 10:00:00 \n", "3001 Bi-Weekly purchase 2023-01-20 10:00:00 \n", "1238 Fortnightly view 2023-01-19 10:00:00 \n", "3006 Fortnightly view 2023-01-19 10:00:00 \n", "1294 Every 3 Months purchase 2023-11-01 08:00:00 \n", "3000 Every 3 Months purchase 2023-11-01 08:00:00 \n", "3008 Weekly like 2023-01-17 10:00:00 \n", "1482 Weekly like 2023-01-17 10:00:00 \n", "1638 Weekly view 2023-11-09 08:00:00 \n", "3003 Weekly view 2023-11-09 08:00:00 \n", "3004 Bi-Weekly like 2023-03-30 10:00:00 \n", "2169 Bi-Weekly like 2023-03-30 10:00:00 " ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's filter out the duplicated rows and inspect them\n", "df[df.duplicated(keep=False)].sort_values('Customer ID')" ] }, { "cell_type": "code", "execution_count": null, "id": "6ce07322", "metadata": { "id": "6ce07322" }, "outputs": [], "source": [ "# Let's remove the duplicated rows from the dataset\n", "df.drop_duplicates(inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "e2218fd3", "metadata": { "id": "e2218fd3", "outputId": "41a7c6ac-003c-4a8d-bd00-ef5d47ee3440" }, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Inspecting the dataset after dropping duplicated rows\n", "df.duplicated().sum()" ] }, { "cell_type": "markdown", "id": "9a4766e1", "metadata": { "id": "9a4766e1" }, "source": [ "We've successfully completed the cleaning process for our dataset. The final step we can take is to add some new columns for our analysis and to reindex the data using the **Customer ID** column and sort the dataset based on this column to ensure that the data is organized properly. This will make further analysis more intuitive and accessible." ] }, { "cell_type": "markdown", "id": "95683a11", "metadata": { "id": "95683a11" }, "source": [ "## Adding Month and Weekday Name" ] }, { "cell_type": "markdown", "id": "90597a17", "metadata": { "id": "90597a17" }, "source": [ "To analyze the data by month, we'll need to add a column that specifies the month and another that indicates the weekday name for each transaction. This will enable us to conduct more detailed temporal analysis and understand patterns based on the time of week and month.\n", "\n", "To do this, we can use the `.dt.month_name()` and `.dt.day_name()` methods in Pandas. These functions allow us to extract the month name and weekday name from a `datetime` column." ] }, { "cell_type": "code", "execution_count": null, "id": "d57f8e6c", "metadata": { "id": "d57f8e6c", "outputId": "cf424741-4cb1-42d6-bb5f-63790fb022ce" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Customer IDAgeGenderItem PurchasedCategoryPurchase Amount (USD)LocationSizeColorSeasonReview RatingSubscription StatusShipping TypeDiscount AppliedPromo Code UsedPrevious PurchasesPayment MethodFrequency of PurchasesInteraction typetimestampMonth_NameWeekday_Name
0155MaleBlouseClothing53KentuckyLGrayWinter3.1YesExpressYesYes14VenmoFortnightlypurchase2023-10-10 08:00:00OctoberTuesday
1219MaleSweaterClothing64MaineLMaroonWinter3.1YesExpressYesYes2CashFortnightlyview2023-10-11 08:00:00OctoberWednesday
2350MaleJeansClothing73MassachusettsSMaroonSpring3.1YesFree ShippingYesYes23Credit cardWeeklylike2023-10-12 08:00:00OctoberThursday
3421MaleSandalsFootwear90Rhode islandMMaroonSpring3.5YesNext Day AirYesYes49PaypalWeeklyview2023-10-13 08:00:00OctoberFriday
4545MaleBlouseClothing49OregonMTurquoiseSpring2.7YesFree ShippingYesYes31PaypalAnnuallylike2023-10-14 08:00:00OctoberSaturday
\n", "
" ], "text/plain": [ " Customer ID Age Gender Item Purchased Category Purchase Amount (USD) \\\n", "0 1 55 Male Blouse Clothing 53 \n", "1 2 19 Male Sweater Clothing 64 \n", "2 3 50 Male Jeans Clothing 73 \n", "3 4 21 Male Sandals Footwear 90 \n", "4 5 45 Male Blouse Clothing 49 \n", "\n", " Location Size Color Season Review Rating Subscription Status \\\n", "0 Kentucky L Gray Winter 3.1 Yes \n", "1 Maine L Maroon Winter 3.1 Yes \n", "2 Massachusetts S Maroon Spring 3.1 Yes \n", "3 Rhode island M Maroon Spring 3.5 Yes \n", "4 Oregon M Turquoise Spring 2.7 Yes \n", "\n", " Shipping Type Discount Applied Promo Code Used Previous Purchases \\\n", "0 Express Yes Yes 14 \n", "1 Express Yes Yes 2 \n", "2 Free Shipping Yes Yes 23 \n", "3 Next Day Air Yes Yes 49 \n", "4 Free Shipping Yes Yes 31 \n", "\n", " Payment Method Frequency of Purchases Interaction type timestamp \\\n", "0 Venmo Fortnightly purchase 2023-10-10 08:00:00 \n", "1 Cash Fortnightly view 2023-10-11 08:00:00 \n", "2 Credit card Weekly like 2023-10-12 08:00:00 \n", "3 Paypal Weekly view 2023-10-13 08:00:00 \n", "4 Paypal Annually like 2023-10-14 08:00:00 \n", "\n", " Month_Name Weekday_Name \n", "0 October Tuesday \n", "1 October Wednesday \n", "2 October Thursday \n", "3 October Friday \n", "4 October Saturday " ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's add the month and week name columns\n", "df['Month_Name'] = df['timestamp'].dt.month_name()\n", "df['Weekday_Name'] = df['timestamp'].dt.day_name()\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "id": "88b5aa94", "metadata": { "id": "88b5aa94" }, "source": [ "## Resetting the Index and Sorting the Dataframe" ] }, { "cell_type": "code", "execution_count": null, "id": "948da9fa", "metadata": { "id": "948da9fa", "outputId": "ef935a3e-c07d-4051-9e3c-9ee80d384e6a" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeGenderItem PurchasedCategoryPurchase Amount (USD)LocationSizeColorSeasonReview RatingSubscription StatusShipping TypeDiscount AppliedPromo Code UsedPrevious PurchasesPayment MethodFrequency of PurchasesInteraction typetimestampMonth_NameWeekday_Name
Customer ID
155MaleBlouseClothing53KentuckyLGrayWinter3.1YesExpressYesYes14VenmoFortnightlypurchase2023-10-10 08:00:00OctoberTuesday
219MaleSweaterClothing64MaineLMaroonWinter3.1YesExpressYesYes2CashFortnightlyview2023-10-11 08:00:00OctoberWednesday
350MaleJeansClothing73MassachusettsSMaroonSpring3.1YesFree ShippingYesYes23Credit cardWeeklylike2023-10-12 08:00:00OctoberThursday
421MaleSandalsFootwear90Rhode islandMMaroonSpring3.5YesNext Day AirYesYes49PaypalWeeklyview2023-10-13 08:00:00OctoberFriday
545MaleBlouseClothing49OregonMTurquoiseSpring2.7YesFree ShippingYesYes31PaypalAnnuallylike2023-10-14 08:00:00OctoberSaturday
\n", "
" ], "text/plain": [ " Age Gender Item Purchased Category Purchase Amount (USD) \\\n", "Customer ID \n", "1 55 Male Blouse Clothing 53 \n", "2 19 Male Sweater Clothing 64 \n", "3 50 Male Jeans Clothing 73 \n", "4 21 Male Sandals Footwear 90 \n", "5 45 Male Blouse Clothing 49 \n", "\n", " Location Size Color Season Review Rating \\\n", "Customer ID \n", "1 Kentucky L Gray Winter 3.1 \n", "2 Maine L Maroon Winter 3.1 \n", "3 Massachusetts S Maroon Spring 3.1 \n", "4 Rhode island M Maroon Spring 3.5 \n", "5 Oregon M Turquoise Spring 2.7 \n", "\n", " Subscription Status Shipping Type Discount Applied \\\n", "Customer ID \n", "1 Yes Express Yes \n", "2 Yes Express Yes \n", "3 Yes Free Shipping Yes \n", "4 Yes Next Day Air Yes \n", "5 Yes Free Shipping Yes \n", "\n", " Promo Code Used Previous Purchases Payment Method \\\n", "Customer ID \n", "1 Yes 14 Venmo \n", "2 Yes 2 Cash \n", "3 Yes 23 Credit card \n", "4 Yes 49 Paypal \n", "5 Yes 31 Paypal \n", "\n", " Frequency of Purchases Interaction type timestamp \\\n", "Customer ID \n", "1 Fortnightly purchase 2023-10-10 08:00:00 \n", "2 Fortnightly view 2023-10-11 08:00:00 \n", "3 Weekly like 2023-10-12 08:00:00 \n", "4 Weekly view 2023-10-13 08:00:00 \n", "5 Annually like 2023-10-14 08:00:00 \n", "\n", " Month_Name Weekday_Name \n", "Customer ID \n", "1 October Tuesday \n", "2 October Wednesday \n", "3 October Thursday \n", "4 October Friday \n", "5 October Saturday " ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's set the Customer ID column as the index\n", "df.set_index('Customer ID', inplace=True, drop=True)\n", "\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "9f880a15", "metadata": { "id": "9f880a15", "outputId": "74444d38-86b6-453a-ba41-3129111ec3aa" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeGenderItem PurchasedCategoryPurchase Amount (USD)LocationSizeColorSeasonReview RatingSubscription StatusShipping TypeDiscount AppliedPromo Code UsedPrevious PurchasesPayment MethodFrequency of PurchasesInteraction typetimestampMonth_NameWeekday_Name
Customer ID
155MaleBlouseClothing53KentuckyLGrayWinter3.1YesExpressYesYes14VenmoFortnightlypurchase2023-10-10 08:00:00OctoberTuesday
219MaleSweaterClothing64MaineLMaroonWinter3.1YesExpressYesYes2CashFortnightlyview2023-10-11 08:00:00OctoberWednesday
350MaleJeansClothing73MassachusettsSMaroonSpring3.1YesFree ShippingYesYes23Credit cardWeeklylike2023-10-12 08:00:00OctoberThursday
421MaleSandalsFootwear90Rhode islandMMaroonSpring3.5YesNext Day AirYesYes49PaypalWeeklyview2023-10-13 08:00:00OctoberFriday
545MaleBlouseClothing49OregonMTurquoiseSpring2.7YesFree ShippingYesYes31PaypalAnnuallylike2023-10-14 08:00:00OctoberSaturday
\n", "
" ], "text/plain": [ " Age Gender Item Purchased Category Purchase Amount (USD) \\\n", "Customer ID \n", "1 55 Male Blouse Clothing 53 \n", "2 19 Male Sweater Clothing 64 \n", "3 50 Male Jeans Clothing 73 \n", "4 21 Male Sandals Footwear 90 \n", "5 45 Male Blouse Clothing 49 \n", "\n", " Location Size Color Season Review Rating \\\n", "Customer ID \n", "1 Kentucky L Gray Winter 3.1 \n", "2 Maine L Maroon Winter 3.1 \n", "3 Massachusetts S Maroon Spring 3.1 \n", "4 Rhode island M Maroon Spring 3.5 \n", "5 Oregon M Turquoise Spring 2.7 \n", "\n", " Subscription Status Shipping Type Discount Applied \\\n", "Customer ID \n", "1 Yes Express Yes \n", "2 Yes Express Yes \n", "3 Yes Free Shipping Yes \n", "4 Yes Next Day Air Yes \n", "5 Yes Free Shipping Yes \n", "\n", " Promo Code Used Previous Purchases Payment Method \\\n", "Customer ID \n", "1 Yes 14 Venmo \n", "2 Yes 2 Cash \n", "3 Yes 23 Credit card \n", "4 Yes 49 Paypal \n", "5 Yes 31 Paypal \n", "\n", " Frequency of Purchases Interaction type timestamp \\\n", "Customer ID \n", "1 Fortnightly purchase 2023-10-10 08:00:00 \n", "2 Fortnightly view 2023-10-11 08:00:00 \n", "3 Weekly like 2023-10-12 08:00:00 \n", "4 Weekly view 2023-10-13 08:00:00 \n", "5 Annually like 2023-10-14 08:00:00 \n", "\n", " Month_Name Weekday_Name \n", "Customer ID \n", "1 October Tuesday \n", "2 October Wednesday \n", "3 October Thursday \n", "4 October Friday \n", "5 October Saturday " ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's sort the dataframe based on the index in an ascending order\n", "df.sort_index(inplace=True)\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "id": "ffb1eb00", "metadata": { "id": "ffb1eb00" }, "source": [ "We can now save the cleaned version of the dataset to use it in our next lecture for visualizations." ] }, { "cell_type": "code", "execution_count": null, "id": "cafd2ea7", "metadata": { "id": "cafd2ea7" }, "outputs": [], "source": [ "df.to_csv('/Users/amirrahimizadeh/Downloads/cleaned-e-commerce.csv', index= True)" ] }, { "cell_type": "markdown", "id": "8f15ccb8", "metadata": { "id": "8f15ccb8" }, "source": [ "# Conclusions and Next Steps" ] }, { "cell_type": "markdown", "id": "8d1f23df", "metadata": { "id": "8d1f23df" }, "source": [ "In the first part of the Python Project Sprint, we went through various cleaning steps and thoroughly cleaned the dataset by adjusting data types, removing duplicates and missing values, normalizing the text columns, and removing redundant columns.\n", "\n", "In our next lecture, we will first combine all of the cleaning steps into a function that automates this process, streamlining our cleaning workflow so we can seamlessly clean any new data we receive. Then, we will visualize the data and extract insights and patterns to answer the project goal." ] }, { "cell_type": "markdown", "id": "b6fada3b", "metadata": { "id": "b6fada3b" }, "source": [ "
\n", "
\n", "
\n", "
" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.5" } }, "nbformat": 4, "nbformat_minor": 5 }