Skip to main content

Decision Tree Algorithm

 Decision Tree Algorithm

The ID3 (Iterative Dichotomiser 3) algorithm is a Decision Tree classification algorithm that selects attributes based on Information Gain (IG) to maximize data purity at each step.


Step 1: Understanding the Key Concepts

Before implementing ID3, let's understand the core concepts:

1. Entropy (H)

Entropy measures the impurity or uncertainty in the dataset:

  • If all samples belong to one class, entropy = 0 (pure dataset).
  • If the dataset is evenly split between classes, entropy = 1 (most uncertain).

Formula for Entropy:

H(S)=pilog2(pi)H(S) = - \sum p_i \log_2(p_i)

Where:

  • pip_i = Probability of class ii in dataset SS
  • log2\log_2 = Logarithm base 2

2. Information Gain (IG)

Information Gain tells us how much entropy is reduced after splitting the data using an attribute.

Formula for Information Gain:

IG(S,A)=H(S)H(SA)IG(S,A) = H(S) - H(S|A)

Where:

  • H(S)H(S) = Entropy before splitting
  • H(SA)H(S|A) = Weighted entropy after splitting by attribute AA

The attribute with the highest IG is chosen for splitting.


Step 2: Example Dataset

Let’s use an example dataset where we classify whether a person buys a computer based on Age, Income, Student Status, and Credit Rating.

AgeIncomeStudentCredit RatingBuys Computer?
YouthHighNoFairNo
YouthHighNoExcellentNo
MiddleHighNoFairYes
SeniorMediumNoFairYes
SeniorLowYesFairYes
SeniorLowYesExcellentNo
MiddleLowYesExcellentYes
YouthMediumNoFairNo
YouthLowYesFairYes
SeniorMediumYesFairYes
YouthMediumYesExcellentYes
MiddleMediumNoExcellentYes
MiddleHighYesFairYes
SeniorMediumNoExcellentNo

Here, "Buys Computer?" is the target variable (Yes/No).


Step 3: Calculate Initial Entropy H(S)H(S)

First, calculate the entropy of the target variable.

  • Total samples = 14
  • Yes = 9, No = 5

Using the entropy formula:

H(S)=(914log2914+514log2514)H(S) = - \left(\frac{9}{14} \log_2 \frac{9}{14} + \frac{5}{14} \log_2 \frac{5}{14} \right) H(S)=0.94H(S) = 0.94

Step 4: Calculate Entropy for Each Attribute

Now, calculate entropy after splitting for each attribute.

Splitting by "Age"

We split the dataset into three groups: Youth, Middle, and Senior.

AgeYesNoTotalEntropy
Youth2350.971
Middle4040
Senior3250.971

Weighted Entropy:

H(SAge)=(514×0.971)+(414×0)+(514×0.971)H(S|Age) = \left(\frac{5}{14} \times 0.971\right) + \left(\frac{4}{14} \times 0\right) + \left(\frac{5}{14} \times 0.971\right) H(SAge)=0.693H(S|Age) = 0.693

Information Gain for Age:

IG(Age)=H(S)H(SAge)=0.940.693=0.247IG(Age) = H(S) - H(S|Age) = 0.94 - 0.693 = 0.247

Similarly, we compute IG for Income, Student, and Credit Rating:

  • IG(Age) = 0.247
  • IG(Income) = 0.029
  • IG(Student) = 0.151
  • IG(Credit Rating) = 0.048

Since Age has the highest IG, it becomes the root node.


Step 5: Choose the Best Attribute and Split the Data

Now that Age is the root, we repeat the process for each subset.


Step 6: Repeat Until the Tree is Complete

Step 6.1: Splitting the "Youth" Subset

AgeIncomeStudentCredit RatingBuys Computer?
YouthHighNoFairNo
YouthHighNoExcellentNo
YouthMediumNoFairNo
YouthLowYesFairYes
YouthMediumYesExcellentYes

Entropy Calculation:

H(Youth)=0.971H(Youth) = 0.971

Best Split: "Student"

IG(Student)=0.970IG(Student) = 0.970

Now, we split Youth by Student.


Step 6.2: Splitting the "Middle" Subset

AgeIncomeStudentCredit RatingBuys Computer?
MiddleHighNoFairYes
MiddleLowYesExcellentYes
MiddleMediumNoExcellentYes
MiddleHighYesFairYes

Since all instances are Yes, entropy = 0, so no further splitting is needed.


Step 6.3: Splitting the "Senior" Subset

AgeIncomeStudentCredit RatingBuys Computer?
SeniorMediumNoFairYes
SeniorLowYesFairYes
SeniorLowYesExcellentNo
SeniorMediumNoExcellentNo
SeniorMediumYesFairYes

Entropy Calculation:

H(Senior)=0.971H(Senior) = 0.971

Best Split: "Student"


Final Decision Tree

yaml
[Age] / | \ Youth Middle Senior / \ | / \ [Student] Yes [Student] / \ / \ No Yes Yes No
  • Youth is split by Student.
  • Middle is pure (Yes).
  • Senior is split by Student.

Step 7: Classify New Data

If a new person has:

  • Age = Youth
  • Income = Medium
  • Student = Yes
  • Credit Rating = Fair

Following the tree:

  1. Age = Youth → Split by Student
  2. Student = Yes → Predict "Yes" (Buys Computer)

Thus, the final prediction is "Yes".

PROGRAM

import numpy as np

import pandas as pd


class ID3DecisionTree:

    def __init__(self):

        self.tree = None  # Store the final decision tree


    def entropy(self, target_col):

        """Calculate entropy of a dataset."""

        values, counts = np.unique(target_col, return_counts=True)

        probabilities = counts / len(target_col)

        return -np.sum(probabilities * np.log2(probabilities))


    def information_gain(self, data, feature, target):

        """Calculate the Information Gain of a feature."""

        total_entropy = self.entropy(data[target])


        # Calculate entropy for each subset after splitting by feature

        values, counts = np.unique(data[feature], return_counts=True)

        weighted_entropy = np.sum([

            (counts[i] / np.sum(counts)) * self.entropy(data[data[feature] == values[i]][target]) 

            for i in range(len(values))

        ])


        return total_entropy - weighted_entropy  # Information Gain


    def best_feature_to_split(self, data, target):

        """Find the feature with the highest Information Gain."""

        features = data.columns.drop(target)

        gains = {feature: self.information_gain(data, feature, target) for feature in features}

        return max(gains, key=gains.get)  # Return the feature with highest IG


    def build_tree(self, data, target):

        """Build the decision tree using the ID3 algorithm."""

        target_values = np.unique(data[target])


        # Base case: If all target values are the same, return that value

        if len(target_values) == 1:

            return target_values[0]


        # Base case: If no features left, return the most common target value

        if len(data.columns) == 1:

            return data[target].mode()[0]


        # Choose the best feature for splitting

        best_feature = self.best_feature_to_split(data, target)

        tree = {best_feature: {}}


        # Split data and recursively build subtrees

        for value in np.unique(data[best_feature]):

            subset = data[data[best_feature] == value].drop(columns=[best_feature])

            tree[best_feature][value] = self.build_tree(subset, target)


        return tree


    def fit(self, data, target):

        """Train the model by building the tree."""

        self.tree = self.build_tree(data, target)


    def predict(self, sample):

        """Classify a new sample using the trained decision tree."""

        def classify(tree, sample):

            if not isinstance(tree, dict):

                return tree  # Leaf node reached


            root = next(iter(tree))  # Get root node

            value = sample.get(root, None)  # Get value for the root node

            

            if value in tree[root]:  # Traverse the tree

                return classify(tree[root][value], sample)

            else:

                return "Unknown"  # If value is not present in the tree


        return classify(self.tree, sample)


# Sample dataset

data = pd.DataFrame({

    'Age': ['Youth', 'Youth', 'Middle', 'Senior', 'Senior', 'Senior', 'Middle', 'Youth', 'Youth', 'Senior', 'Youth', 'Middle', 'Middle', 'Senior'],

    'Income': ['High', 'High', 'High', 'Medium', 'Low', 'Low', 'Low', 'Medium', 'Low', 'Medium', 'Medium', 'Medium', 'High', 'Medium'],

    'Student': ['No', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No'],

    'CreditRating': ['Fair', 'Excellent', 'Fair', 'Fair', 'Fair', 'Excellent', 'Excellent', 'Fair', 'Fair', 'Fair', 'Excellent', 'Excellent', 'Fair', 'Excellent'],

    'BuysComputer': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']

})


# Train the decision tree model

tree_model = ID3DecisionTree()

tree_model.fit(data, 'BuysComputer')


# Print the decision tree

print("Decision Tree:", tree_model.tree)


# Example: Classify a new sample

new_sample = {'Age': 'Youth', 'Income': 'Medium', 'Student': 'Yes', 'CreditRating': 'Fair'}

prediction = tree_model.predict(new_sample)

print(f"Prediction for {new_sample}: {prediction}")


Comments

Popular posts from this blog

OPRATING SYSTEM

 https://drive.google.com/drive/folders/1VAYZuJLVeDBUd1UM2C4nvJTysqB3fMmt 📌 Module 1: Operating System Fundamentals 1️⃣ Introduction to Operating System & Evolution What is an Operating System (OS)? An Operating System (OS) is system software that acts as an interface between hardware and users , managing system resources efficiently. Key Functions of an OS ✅ Process Management – Controls execution of programs ✅ Memory Management – Allocates memory to processes ✅ File System Management – Manages files and directories ✅ Device Management – Controls hardware devices ✅ Security & Access Control – Protects data and system resources 2️⃣ Evolution of Operating Systems Era OS Type Example Characteristics 1950s Batch OS IBM Mainframes Executes jobs sequentially 1960s Multiprogramming OS UNIX Runs multiple processes simultaneously 1970s Time-Sharing OS MULTICS CPU switches between multiple users 1980s Personal Computing OS MS-DOS, Mac OS Single-user OS for PCs 199...

sql

  📌 Module 1: Introduction to SQL 🔹 What is SQL? 🔹 Database vs. DBMS vs. RDBMS 🔹 Types of Databases (SQL vs. NoSQL) 🔹 Popular SQL Databases (MySQL, PostgreSQL, SQLite, Oracle, MS SQL Server) 🔹 Installing MySQL/PostgreSQL & Setting Up Database ✅ Hands-on: ✔️ Install MySQL/PostgreSQL & Set Up a Test Database 📌 Module 2: SQL Basics - Data Retrieval 🔹 Understanding Database Tables & Schemas 🔹 SELECT Statement – Retrieving Data 🔹 Using WHERE Clause for Filtering 🔹 ORDER BY for Sorting Results 🔹 Using LIMIT & OFFSET for Pagination ✅ Hands-on: ✔️ Retrieve Employee Details from a Database 📌 Module 3: SQL Functions & Aggregation 🔹 Built-in Functions ( COUNT() , SUM() , AVG() , MIN() , MAX() ) 🔹 Using GROUP BY for Aggregation 🔹 HAVING Clause for Filtering Groups 🔹 Using DISTINCT for Unique Values ✅ Hands-on: ✔️ Find Total Salary, Average Salary, and Count of Employees per Department 📌 Module 4: SQL Joins & Relationsh...