.. _creating_data_geodata_eng_ai:

==========================================================================
Creating Data for Geodata Engineering and Artificial Intelligence Modeling
==========================================================================

Kalpa makes it easy to prepare datasets for geodata engineering, geostatistics, and machine learning applications. By extracting information from loaded raster and vector layers, you can create structured, machine-trainable datasets tailored to your specific modeling needs. This chapter outlines how to filter, process, and extract data seamlessly.

Overview
--------
When you load multiple layers of raster and vector data, Kalpa allows you to extract their attributes for further analysis. This process involves creating a grid over a defined area of interest (AOI) and sampling raster values or vector attributes into a new dataset. The sampled data is saved as a vector file (``.gpkg``) and includes:

- **Raster Values**: Directly sampled at grid points.
- **Vector Attributes**: Selected columns from vector data and their distances to grid points.

For example, if you select a vector column named ``Fault_Age``, the new layer will contain:

- A column named ``Fault_Age``, which stores the sampled attribute value.
- A column named ``Fault_Age_dist``, which stores the distance from the grid point to the nearest vector geometry.

Grid Types for Data Sampling
----------------------------
Kalpa supports two primary gridding approaches to structure the sampling process:

1. **Random Grid**
   - **Description**: Randomly distributes points across the AOI.
   - **Applications**:
     - Suitable for creating unbiased training datasets for machine learning models.
     - Reduces spatial autocorrelation in training data, improving generalization.
   - **Benefits**:
     - Provides diverse sampling across the area.
     - Reduces overrepresentation of specific regions or patterns.

2. **Regular Grid**
   - **Description**: Creates a grid with uniform spacing based on a specified resolution.
   - **Applications**:
     - Ideal for geodata engineering tasks, such as image processing or geophysical filtering (e.g., upward and downward continuation).
     - Useful for spatial modeling and interpolation.
   - **Benefits**:
     - Ensures consistent coverage across the AOI.
     - Facilitates compatibility with raster-based algorithms.

For regular grids, the X and Y resolutions are identical, ensuring a uniform grid layout.

Step-by-Step Guide: Sampling Data
---------------------------------

1. **Accessing the Sampling Tool**
   - Navigate to **Data Processing > Data Sampling** to open the data sampling interface.

2. **Defining the Area of Interest (AOI)**
   - You must specify the region where the data will be sampled. You can define the AOI in two ways:
     - **Bounding Box**:
       - Use the **Bounding Box Utility** to create a rectangular AOI based on an existing raster or vector layer.
       - Go to **Vector > Bounding Box**, select the layer, and save the bounding box layer.
     - **Vector File**:
       - Use a vector file with complex polygon or multipolygon geometries, or point geometries, as the AOI.

3. **Selecting Data Layers**
   - **Raster Layers**:
     - Select one or more loaded raster layers to sample values at grid points.
   - **Vector Layers**:
     - Choose vector layers, and a dropdown with checkboxes will appear. You can select specific columns (attributes) from the vector data for sampling.

4. **Choosing a Gridding Method**
   - Select one of the two available gridding methods:
     - **Random Grid**: Specify the number of points to generate.
     - **Regular Grid**: Set the resolution of the grid.

5. **Creating the Grid and Sampling Data**
   1. Click **Create** to generate the grid and begin sampling data.
   2. Wait for the process to complete. A progress bar or notification will indicate the status.

6. **Saving the Sampled Data**
   - After the sampling process finishes, a saving window will appear. You can specify the file name and save the output as a vector file (``.gpkg``).
   - This vector file will be added as a new layer in the **Layer Window**.

Tips for Effective Data Sampling
--------------------------------

- For large datasets, consider using an AOI that reduces the sampling region to save computational resources and time.
- When working with machine learning models, using a **Random Grid** can help reduce sampling bias and improve model performance.
- For spatially dense geodata engineering tasks, use a **Regular Grid** with a resolution that matches the scale of your analysis.