Customizing Multiple Extraction Schemas¶
While defining a single schema are useful for many extraction tasks, there are scenarios where using multiple schemas can provide better organization and more accurate representation of the data. This guide will walk you through the process of creating and managing multiple schemas in Extralit.
Why Use Multiple Schemas?¶
Multiple schemas are beneficial when:
- Different parts of a paper contain distinct types of information.
- There are one-to-many relationships between data points.
- You want to establish relational links between different types of data.
- You need to prevent data duplication and maintain data integrity.
Example 1: Separating Study Design and Demographic Information¶
Let's consider a scenario where a scientific paper presents information about two studies, each with multiple demographic groups. If we try to capture all this information in a single schema, we might end up with redundant data entry.
Here's an example of what the data in the paper might look like:
Year | Study Type | Age Group | Gender | Count |
---|---|---|---|---|
2020 | RCT | Child | Male | 50 |
2020 | RCT | Child | Female | 55 |
2020 | RCT | Adult | Male | 100 |
2020 | RCT | Adult | Female | 95 |
2021 | Observational | Adult | Male | 75 |
2021 | Observational | Adult | Female | 80 |
2021 | Observational | Senior | Male | 40 |
2021 | Observational | Senior | Female | 45 |
If we were to use a single schema to capture this information, it might look like this:
import pandera as pa
from pandera.typing import Series
class StudyDemographic(pa.DataFrameModel):
year: Series[int] = pa.Field(ge=2000, le=2024)
study_type: Series[str] = pa.Field(isin=['RCT', 'Observational', 'Meta-analysis'])
age_group: Series[str] = pa.Field(isin=['Child', 'Adult', 'Senior'])
gender: Series[str] = pa.Field(isin=['Male', 'Female', 'Other'])
count: Series[int] = pa.Field(gt=0)
However, using this schema would require redundant manual data entry. Notice how we would have to repeat the study information (year and study_type) for each demographic entry. This redundancy can lead to data inconsistencies that is exacerbated as the number of demographic groups increase, and makes the data more difficult to update/correct.
To solve this, we can separate our schema into two:
import pandera as pa
from pandera.typing import Series, Index
class StudyDesign(pa.DataFrameModel):
StudyDesign_ID: Index[str] = pa.Field(unique=True)
year: Series[int] = pa.Field(ge=2000, le=2023)
sample_size: Series[int] = pa.Field(gt=0)
study_type: Series[str] = pa.Field(isin=['RCT', 'Observational', 'Meta-analysis'])
class Demographic(pa.DataFrameModel):
StudyDesign_ID: Series[str] # This will be our foreign key
age_group: Series[str] = pa.Field(isin=['Child', 'Adult', 'Senior'])
gender: Series[str] = pa.Field(isin=['Male', 'Female', 'Other'])
count: Series[int] = pa.Field(gt=0)
Note that we've introduced a StudyDesign_ID
field in the StudyDesign
and Demographic
schemas, which serves as a foreign key linking Demographic
data to the StudyDesign
information.
Now, we can represent our data more efficiently:
Study Design Table
StudyDesign_ID | year | study_type |
---|---|---|
S01 | 2020 | RCT |
S02 | 2021 | Observational |
Demographic Table
StudyDesign_ID | age_group | gender | count |
---|---|---|---|
S01 | Child | Male | 50 |
S01 | Child | Female | 55 |
S01 | Adult | Male | 100 |
S01 | Adult | Female | 95 |
S02 | Adult | Male | 75 |
S02 | Adult | Female | 80 |
S02 | Senior | Male | 40 |
S02 | Senior | Female | 45 |
This approach eliminates redundancy in the study information and allows for a more flexible representation of the data. It's particularly useful when:
- A single study potentially has large number of demographic groups.
- You want to update study information without affecting demographic data.
- You need to analyze demographic data across multiple studies easily.
In the next sections, we'll explore how to establish relationships between these schemas and how to manage them in Extralit.
Example 2: Establishing Relational Schemas¶
Let's extend our example to include a third schema for outcome measures:
class OutcomeMeasure(pa.DataFrameModel):
measure_id: Index[str] = pa.Field(unique=True)
study_id: Series[str] # Foreign key to StudyDesign
demographic_id: Series[str] # Foreign key to Demographic
measure_type: Series[str] = pa.Field(isin=['Primary', 'Secondary'])
value: Series[float] = pa.Field(ge=0)
In this schema:
study_id
links the outcome to a specific study.demographic_id
optionally links the outcome to a specific demographic group.
This structure allows for complex querying across all three schemas, enabling analysis of outcomes by study and demographic characteristics.
Converting Schemas to JSON¶
To use these schemas with Extralit's server, we need to convert them to JSON format. Here's how you can do that:
from os.path import join
target_dir = 'path/to/schemas/'
StudyDesign.to_schema().to_json(join(target_dir, 'study_design_schema.json'))
Demographic.to_schema().to_json(join(target_dir, 'demographic_schema.json'))
OutcomeMeasure.to_schema().to_json(join(target_dir, 'outcome_measure_schema.json'))
This code will create three JSON files containing the schema definitions.
Uploading Schemas to Extralit Server¶
Once you have your schema JSON files, you can upload them to your Extralit workspace using the command-line interface:
Replace {WORKSPACE_NAME}
with the name of your Extralit workspace, and ensure the path to your schema JSON files is correct.
Best Practices for Multiple Schemas¶
-
Keep It Simple: Start with the simplest schema structure that accurately represents your data. You can always add complexity later.
-
Use Meaningful Names: Choose clear, descriptive names for your schemas and fields.
-
Establish Clear Relationships: When using multiple schemas, clearly define how they relate to each other (e.g., through foreign keys).
-
Avoid Redundancy: Don't duplicate information across schemas unnecessarily. Use references (foreign keys) instead.
-
Consider Extraction Efficiency: Design your schemas to align with how information is typically presented in the papers you're analyzing. This can make the extraction process more straightforward.
-
Validate Relationships: Implement cross-schema validation to ensure referential integrity (e.g., every
study_id
inDemographic
exists inStudyDesign
). -
Document Your Schema Structure: Maintain clear documentation of how your schemas relate to each other and what each schema represents.
By thoughtfully designing and implementing multiple schemas, you can create a robust, flexible system for extracting and organizing complex information from scientific papers. This approach allows for more nuanced analysis and can significantly improve the quality and usability of your extracted data.