Monday, 13 January 2025

What is Data Reduction in data science? (What, why and how)










 


Wednesday, 8 January 2025

How to import, locate, load dataset and data preprocessing, formatting, normalization into pandas dataframe in data science.


Step 1: Import and install all dependencies.



Step 2: Search for dataset on kaggle.

 

 

Step 3: Click on All Dataset. it will show all the datasets available there.

 

 

Step 4: Just click on the dataset you required to analyze. it will show below output over there Download button is there just click on it. you will get the code which gives us the dataset directory path.

 

 

Step 5: Below is the code which shows the path of dataset directories as mentioned in step 4. Here we are locating the opensource data from the web.

 

 

Step 6: here we are storing the path inside dataset_dir variable. after that we joining that path with the data.csv file inside data_file variable and then we are displaying the complete path of that dataset on web.

 

 

Step 7: Now we are loading the dataset inside pandas data frame df with the function read_csv(data_file) here we are passing that path to this function. after that we are displaying the five rows data of that dataset.

 

 

Step 8: Check for missing values in dataset


Step 9: Check for missing values in dataset, now if there are rows in thousands we can not check it row by row so to check it overall we are summarizing it as follows


Step 10: To get some initial statistics of dataset we are using the describe function


Step 11: Provide variable descriptions. Types of variables etc. Summarize the types of variables by checking the data types of the variables in the data set.



Step 12: Check the dimensions of the data frame.It gives number of rows and columns present inside the data frame.

 
 
Step 13: If Variables are not in the correct data type, apply proper type conversion
 
 
 
Step 14: How to add extra or new column inside existing data frame. How to find length of each column and how to fill newly added column a random values in it

 
 
 
Step 15: How to change the column values with 0 and 1

 
 
Step 16: How to count the number of  0 and 1 in respective column.

 
 
Step 17: Turn categorical variables into quantitative variables in Python.

Friday, 3 January 2025

What is data explosion with its implication and also explain 5 V' s of Big Data?








 





Wednesday, 1 January 2025

Telephone Book Assignment using Hash Table Implementation (Chaining/ Open Addressing).


Pre-Requisite: 

Before doing programming in C++, we must know which IDE i have to use. Follow just few steps to execute your code:

1. Install Dev-C++

  1. Download Dev-C++:

    • Open your browser visit the official Embarcadero Dev-C++ website or use a trusted source like SourceForge.
    • Download and install the IDE.
  2. Ensure the installer includes the MinGW (GCC) Compiler:

    • Dev-C++ often comes bundled with MinGW, which supports STL

2. Configure Dev-C++

  1. Launch the IDE after installation.

  2. Set the Compiler:

    • Go to Tools > Compiler Options.
    • Check the compiler version and ensure it is modern (preferably GCC 9 or higher for full STL and modern C++ standard support).
  3. Set the C++ Standard (Optional but Recommended):

    • Under Compiler Options, add the following flags in the "Add the following commands when calling the compiler"
    • -std=c++17  

3. Create a New Project

  1. Go to File > New > Project.
  2. Select "Console Application" and set the language to "C++."
  3. Save the project in your desired location.


Let's Start Implementation:

Problem Statement: 

Consider telephone book database of n clients make use of a hash table implementation to quickly look up client's telephone number. Make use of two collision handling techniques and compare them using number of comparisons required to find a set of telephone numbers.

Algorithm:

Step 1: Start
Step 2: Define a Hash Table class with the following components:
        Step 2.1: For chaining, use an array of linked lists to handle collisions.
        Step 2.2: For open addressing, use an array to store keys directly, along with a status array to track filled slots.

Step 3: Hash Function:
        Step 3.1: Use a simple modulo operation to calculate the hash index.

Step 4: Insert a client's phone number:
        Step 4.1: Compute the hash index using the hash function.
        Step 4.2: Handle collisions using the chosen technique.

Step 5: Search for a client's phone number:
        Step 5.1: Compute the hash index and probe the table until the number is found or confirmed absent.

Step 6: Compare the techniques:
        Step 6.1: Measure the number of comparisons for both techniques when looking up a set of numbers.
Step 7: Stop.


Code 1:

1. Hash Table with Chaining:

#include <iostream>
#include <cstring>
using namespace std;

#define TABLE_SIZE 10
#define MAX_CHAIN_SIZE 5 // Maximum number of entries in a chain

class HashTable {
private:
    string names[TABLE_SIZE][MAX_CHAIN_SIZE];  // 2D array for names
    string phones[TABLE_SIZE][MAX_CHAIN_SIZE]; // 2D array for phone numbers
    int chainSize[TABLE_SIZE];               // To track the number of entries in each bucket

    // Hash function
    int hashFunction(const string& key) {
        int hash = 0;
        for (char c : key) {
            hash += c;
        }
        return hash % TABLE_SIZE;
    }

public:
    // Constructor
    HashTable() {
        memset(chainSize, 0, sizeof(chainSize));
    }

    // Insert function
    void insert(const string& name, const string& phone) {
        int index = hashFunction(name);
        if (chainSize[index] < MAX_CHAIN_SIZE) {
            names[index][chainSize[index]] = name;
            phones[index][chainSize[index]] = phone;
            chainSize[index]++;
        } else {
            cout << "Error: Bucket overflow! Cannot insert " << name << ".\n";
        }
    }

// Search function with static variable to count comparisons
string search(const string& name) {
    static int totalComparisons = 0; // Static variable to keep track of total comparisons
    int comparisons = 0;            // Local variable for comparisons in this call

    int index = hashFunction(name);
    for (int i = 0; i < chainSize[index]; ++i) {
        comparisons++;
        totalComparisons++; // Increment the static variable
        if (names[index][i] == name) {
            //cout << "Comparisons for this search: " << comparisons << endl;
            cout << "Total comparisons so far: " << totalComparisons << endl;
            return phones[index][i];
        }
    }

    //cout << "Comparisons for this search: " << comparisons << endl;
    cout << "Total comparisons so far: " << totalComparisons << endl;
    return "Not Found";
  }
  
      void display() {
        cout << "Hash Table Contents:\n";
        for (int i = 0; i < TABLE_SIZE; ++i) {
            cout << "Bucket " << i << ": ";
            if (chainSize[i] == 0) {
                cout << "Empty\n";
            } else {
                for (int j = 0; j < chainSize[i]; ++j) {
                    cout << "[" << names[i][j] << ": " << phones[i][j] << "] ";
                }
                cout << "\n";
            }
        }
    }


};

int main() {
    HashTable hashTable;

    int n;
    cout << "Enter the number of clients: ";
    cin >> n;

    // Insert data into the hash table
    for (int i = 0; i < n; ++i) {
        string name, phone;
        cout << "Enter name of client " << i + 1 << ": ";
        cin >> name;
        cout << "Enter phone number of client " << i + 1 << ": ";
        cin >> phone;
        hashTable.insert(name, phone);
    }
cout<<"\n";
cout<<"\n Let's Search name from Hash Table:\n";
    // Search for a key
    string searchName;
    cout << "Enter the name to search for: ";
    cin >> searchName;
    string result = hashTable.search(searchName);
    if (result != "Not Found") {
        cout << "Phone number of " << searchName << ": " << result <<endl;
    } else {
        cout << searchName << " not found in the telephone book.\n";
    }
    cout<<"\n";
    cout<<"\nHash Table element are as follows...\n";
    hashTable.display();

    return 0;
}

Output:


Explanation of Output:

As we see in above output some inputs are get stored in same bucket. As shown in above Bucket no 4 Three inputs are stored in same Bucket no 4 this happens just because of Hash Function we used. The hash function returns the same index for all those inputs, so they gets stored in same bucket.

let's see how hash function works:

S = 83 + o = 111 + n = 110 + a = 97 + l = 108 + i = 105    Total = 614  

when we find the reminder of above name ASCII Values sum will get 614%10 = 4

N = 78 + i = 105 + t = 116 + i = 105 + n = 110 Total = 514  

when we find the reminder of above name ASCII Values sum will get 514%10 = 4


Code 2:

2. A C++ implementation of a hash table using open addressing with linear probing for collision resolution. This implementation includes insert, search, and display operations.


#include <iostream>
#include <cstring>
using namespace std;

#define TABLE_SIZE 10
#define EMPTY "EMPTY"  // Placeholder for empty slots
#define DELETED "DELETED" // Placeholder for deleted slots

class HashTable {
private:
    string names[TABLE_SIZE];  // Array for names
    string phones[TABLE_SIZE]; // Array for phone numbers
    bool occupied[TABLE_SIZE]; // Tracks occupied slots

    // Hash function
    int hashFunction(const string& key) {
        int hash = 0;
        for (char c : key) {
            hash += c;
        }
        return hash % TABLE_SIZE;
    }

public:
    // Constructor
    HashTable() {
        for (int i = 0; i < TABLE_SIZE; ++i) {
            names[i] = EMPTY;
            phones[i] = EMPTY;
            occupied[i] = false;
        }
    }

    // Insert function
    void insert(const string& name, const string& phone) {
        int index = hashFunction(name);
        int start = index;
        while (names[index] != EMPTY && names[index] != DELETED) {
            index = (index + 1) % TABLE_SIZE;
            if (index == start) { // Table is full
                cout << "Error: Hash table is full. Cannot insert " << name << ".\n";
                return;
            }
        }
        names[index] = name;
        phones[index] = phone;
        occupied[index] = true;
    }

    // Search function
    string search(const string& name) {
        int index = hashFunction(name);
        int start = index;
        int comparison=0;
        while (names[index] != EMPTY) {
        comparison++;
            if (names[index] == name) {
            cout<<"\nTotal Comparisions take to search is:"<<comparison<<endl;
                return phones[index];
            }
            index = (index + 1) % TABLE_SIZE;
            if (index == start) { // Avoid infinite loops
                break;
            }
        }
        cout<<"\nTotal Comparisions take to search is:"<<comparison<<endl;
        return "Not Found";
    }

    // Display function
    void display() {
        cout << "Hash Table Contents:\n";
        for (int i = 0; i < TABLE_SIZE; ++i) {
            if (names[i] == EMPTY || names[i] == DELETED) {
                cout << "Bucket " << i << ": [EMPTY]\n";
            } else {
                cout << "Bucket " << i << ": [" << names[i] << ": " << phones[i] << "]\n";
            }
        }
    }
};

int main() {
    HashTable hashTable;

    int n;
    cout << "Enter the number of clients: ";
    cin >> n;

    // Insert data into the hash table
    for (int i = 0; i < n; ++i) {
        string name, phone;
        cout << "Enter name of client " << i + 1 << ": ";
        cin >> name;
        cout << "Enter phone number of client " << i + 1 << ": ";
        cin >> phone;
        hashTable.insert(name, phone);
    }

    // Display the hash table
    hashTable.display();

    // Search for a key
    string searchName;
    cout << "Enter the name to search for: ";
    cin >> searchName;
    string result = hashTable.search(searchName);
    if (result != "Not Found") {
        cout << "Phone number of " << searchName << ": " << result << endl;
    } else {
        cout << searchName << " not found in the telephone book.\n";
    }

    return 0;
}

Output:



Question 1: When to use chaining and when to use open addressing Method?

Ans: You use chaining when data is more or we can say when we expect a high load factor, one more thing when we are inserting data at that time memory is not a constraint that is we may use memory as we want means there is no restriction that only 550MB we have to use then in that case we may go with chaining also when we want frequent insertion and deletion operation in that case also we can go with chaining.

You use open addressing when there is no frequent insertion and deletion operation, when there is memory constraint, also when dataset is smaller, or we can say we keep load factor low in those cases we may go with open addressing method.




Note: For chaining we have used array, you can use linked list as a data structure.


Basics and need of Data Science and Big Data, Applications of Data Science

 




























Sunday, 1 December 2024

FPP Assignment No 14 and 15

Problem Statement:
Develop a program to create a DataFrame from a NumPy array with custom column names.


import numpy as np import pandas as pd # Create a NumPy array data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # Define custom column names columns = ['Column A', 'Column B', 'Column C'] # Create a DataFrame df = pd.DataFrame(data, columns=columns) print("DataFrame created from NumPy array:") print(df)

2. Drawing a Bar Plot and Scatter Plot using Matplotlib


import matplotlib.pyplot as plt
# Data for plots categories = ['A', 'B', 'C', 'D'] values = [4, 7, 1, 8] x = [1, 2, 3, 4] y = [10, 20, 25, 30] # Bar Plot plt.figure(figsize=(8, 4)) plt.bar(categories, values, color='skyblue') plt.title("Bar Plot") plt.xlabel("Categories") plt.ylabel("Values") plt.show() # Scatter Plot plt.figure(figsize=(8, 4)) plt.scatter(x, y, color='red', label='Points') plt.title("Scatter Plot") plt.xlabel("X-axis") plt.ylabel("Y-axis") plt.legend() plt.show()

Explanation:

  1. DataFrame Creation:

    • The program uses np.array to create a data matrix.
    • Custom column names are passed to pd.DataFrame to create the DataFrame.
  2. Bar Plot:

    • A bar plot is drawn using plt.bar, with labels for categories and values.
  3. Scatter Plot:

    • A scatter plot is created using plt.scatter, with x and y as input points.

Let me know if you'd like to expand or modify these examples!