.. _filtering_data: ============== Filtering Data ============== In geospatial analysis, filtering vector data is a common task to refine datasets based on specific conditions. Kalpa offers a powerful and flexible filtering mechanism using Python's Pandas-style operations. This chapter demonstrates how to filter vector data using conditions on one or multiple columns, enabling you to create tailored datasets for your analysis. Vector Filtering ---------------- The ``VectorLayerFiltering`` function allows you to filter a vector layer or a file containing vector data (``.gpkg``/``.shp``) using custom conditions written in Python's Pandas-style operations. Function Overview ~~~~~~~~~~~~~~~~~ The ``VectorLayerFiltering`` function applies a condition to filter rows in a vector dataset and returns the filtered result. **Key Arguments:** - A file path to a vector dataset or selection of a layer (e.g., ``.gpkg``, ``.shp``). - ``filter_condition``: A Python condition for filtering rows. The condition uses the format: ``row['column_name'] value`` You can combine multiple conditions using logical operators like ``and``, ``or``, and ``not``. Examples -------- **Single-Column Based Filtering** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Scenario**: Filter all rows where the ``Population`` column is greater than 1,000. - **Condition String**: ``row['Population'] > 1000`` **Multi-Column Based Filtering** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Scenario**: Filter rows where the ``Population`` is greater than 1,000 and the ``City`` starts with the letter 'C'. - **Condition String**: ``row['Population'] > 1000 and row['City'].startswith('C')`` **Filtering Using Equality Conditions** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Scenario**: Filter rows where the ``City`` column is equal to 'B'. - **Condition String**: ``row['City'] == 'B'`` **Combining Conditions with OR** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Scenario**: Filter rows where the ``Population`` is less than 1,000 or the ``City`` starts with 'A'. - **Condition String**: ``row['Population'] < 1000 or row['City'].startswith('A')`` **Filtering by Distance (Geospatial Attributes)** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Scenario**: Filter rows where the distance to a fault line is less than 5 km. - **Condition String**: ``row['Fault_Dist'] < 5`` **Filtering Rows with Numerical Ranges** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Scenario**: Filter rows where ``Population`` is between 500 and 1,500. - **Condition String**: ``500 <= row['Population'] <= 1500`` **Filtering Rows Based on String Patterns** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Scenario**: Filter rows where the ``City`` name contains the letter 'a' (case-insensitive). - **Condition String**: ``row['City'].str.contains('a', case=False)`` **Filtering Rows with Missing or Null Values** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Scenario**: Filter rows where the ``geometry`` column is ``None`` (missing). - **Condition String**: ``row['geometry'] is None`` **Filtering Rows Based on Multiple Conditions (Advanced)** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Scenario**: Filter rows where ``Population`` is greater than 1,000, and the ``City`` does not start with 'A'. - **Condition String**: ``row['Population'] > 1000 and not row['City'].startswith('A')`` **Filtering Rows Using Custom Functions** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Scenario**: Use a custom function to filter rows where the ``City`` name length is greater than 1 character. - **Condition String**: ``len(row['City']) > 1`` **Filtering Geospatial Data by Attribute and Proximity** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Scenario**: Filter rows where faults are older than 50 million years and within 10 km of the grid points. - **Condition String**: ``row['Fault_Age'] > 50 and row['Fault_Dist'] < 10`` **Filtering Using Logical OR Conditions** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Scenario**: Filter rows where the ``City`` is either 'A' or 'C'. - **Condition String**: ``row['City'] in ['A', 'C']`` **Filtering by Area or Length Attributes** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For vector datasets with polygons or lines, you can filter by geometric properties such as area or length. - **Scenario**: Filter polygons where the area is greater than 1,000 square meters. - **Condition String**: ``row['geometry'].area > 1000`` - **Scenario**: Filter line features where the length is less than 500 meters. - **Condition String**: ``row['geometry'].length < 500`` Tips for Writing Filtering Conditions ------------------------------------- 1. **Use Logical Operators**: Combine conditions with ``and``, ``or``, or ``not`` to create complex queries. 2. **Check Data Types**: Ensure your column data types match the condition. Numeric values should not be compared to strings. 3. **Handle Missing Values**: Use Pandas-style operations like ``row['column'].notnull()`` to filter out rows with missing data. 4. **Validate Columns**: Ensure the columns used in filtering exist in the dataset. 5. **Optimize Conditions**: Use simple, efficient conditions to avoid unnecessary computational overhead. 6. **Test Conditions**: Before applying complex filters, test them on a small subset of data to ensure correctness. 7. **Export Results**: Save filtered datasets for further analysis or visualization.