Skip to content

Extractor

The Extractor module analyzes unstructured text and extracts structured data using AI-powered analysis. It handles everything from simple data extraction to complex informal language patterns, making it perfect for processing user-generated content, product descriptions, social media posts, and conversational text.

Think of it as your intelligent data parser that can understand context, handle typos, interpret slang, and even detect rhetorical devices like sarcasm and understatement.

Features

  • Structured data extraction - Extract typed data from unstructured text using predefined JSON schemas
  • Informal text analysis - Identifies and handles informal language patterns, including litotes, slang, and conversational expressions
  • Explanation tracking - Provides reasoning for each extracted data point, improving transparency and debugging
  • Schema validation - Supports complex validation rules including enums, ranges, and custom constraints
  • Multi-format support - Works with social media posts, product listings, reviews, chat messages, and more

Basic Usage

Extract structured data from text using predefined schemas:

ruby
# Simple person extraction
text = "John Doe is 25 years old and works as a software engineer"
schema = {
  name: { type: 'string', description: 'Full name of the person' },
  age: { type: 'integer', description: 'Age in years' },
  profession: { type: 'string', description: 'Job title or profession' }
}

result = ActiveGenie::Extractor.call(text, schema)
# => {
#      name: "John Doe",
#      name_explanation: "Found directly at the beginning of text",
#      age: 25,
#      age_explanation: "Explicitly stated as 25 years old",
#      profession: "software engineer",
#      profession_explanation: "Identified as job title after 'works as'"
#    }

Real-World Examples

E-commerce Product Extraction

Perfect for parsing product listings from various sources:

ruby
product = "Sony 65\" Class BRAVIA XR X95K 4K HDR Mini LED TV - $1999.99 (Save $500)"
schema = {
  brand: { type: 'string', description: 'Product brand' },
  display_size: { type: 'string', description: 'Screen size with units' },
  model: { type: 'string', description: 'Product model name' },
  display_type: { type: 'string', description: 'Type of display technology' },
  price: { type: 'number', minimum: 0, description: 'Current price' },
  discount: { type: 'number', minimum: 0, description: 'Discount amount if any' }
}

result = ActiveGenie::Extractor.with_explanation(product, schema)
# => {
#      brand: "Sony",
#      brand_explanation: "Brand name at the beginning",
#      display_size: "65\"",
#      display_size_explanation: "Size specification before 'Class'",
#      model: "BRAVIA XR X95K",
#      model_explanation: "Model name between Class and display type",
#      display_type: "4K HDR Mini LED TV",
#      display_type_explanation: "Complete display technology description",
#      price: 1999.99,
#      price_explanation: "Current price after dash in USD format",
#      discount: 500,
#      discount_explanation: "Savings amount in parentheses"
#    }

Social Media Analysis

Extract insights from social media posts and interactions:

ruby
post = "@alice liked @dave's photo and commented: 'Amazing sunset at the beach!'"
schema = {
  action: {
    type: 'string',
    enum: ["post", "comment", "like", "share", "retweet", "reply"],
    description: 'Primary social media action'
  },
  username: { type: 'string', description: 'Acting user' },
  target_user: { type: 'string', description: 'Target of the action' },
  content_type: { type: 'string', description: 'Type of content interacted with' },
  sentiment: {
    type: 'string',
    enum: ["positive", "negative", "neutral"],
    description: 'Overall sentiment of interaction'
  }
}

result = ActiveGenie::Extractor.call(post, schema)
# => {
#      action: "like",
#      action_explanation: "Primary action mentioned first",
#      username: "alice",
#      username_explanation: "User performing the action",
#      target_user: "dave",
#      target_user_explanation: "Owner of the content being liked",
#      content_type: "photo",
#      content_type_explanation: "Type specified as 'photo'",
#      sentiment: "positive",
#      sentiment_explanation: "Comment shows enthusiasm about sunset"
#    }

Informal Text Processing

The with_litote method extends basic extraction by analyzing rhetorical devices and informal language patterns commonly found in conversational text. This is particularly valuable for processing user-generated content where people use indirect expressions, understatement, and informal speech patterns.

Rhetorical Patterns Detected

  • Litotes ("not bad", "isn't terrible") - Understatement using double negatives
  • Affirmative expressions ("sure", "absolutely", "no problem") - Positive confirmations
  • Negative expressions ("nah", "not really", "kind of meh") - Soft rejections
  • Sarcasm indicators - Context-based sarcasm detection
  • Informal intensifiers ("super", "totally", "kinda") - Casual emphasis

Chat Message Analysis

Perfect for analyzing customer support chats, user feedback, and conversational interfaces:

ruby
chat = "The new update isn't terrible, but it's not exactly amazing either"
schema = {
  sentiment: {
    type: 'string',
    enum: ['positive', 'negative', 'neutral', 'mixed'],
    description: 'Overall sentiment toward the update'
  },
  satisfaction_level: {
    type: 'integer',
    minimum: 1,
    maximum: 5,
    description: 'Satisfaction rating from 1-5'
  },
  feedback_type: {
    type: 'string',
    enum: ['compliment', 'complaint', 'suggestion', 'neutral_observation'],
    description: 'Type of feedback being provided'
  }
}

result = ActiveGenie::Extractor.with_litote(chat, schema)
# => {
#      sentiment: "mixed",
#      sentiment_explanation: "Uses litote 'isn't terrible' showing mild approval, balanced by 'not exactly amazing'",
#      satisfaction_level: 3,
#      satisfaction_level_explanation: "Lukewarm response indicates moderate satisfaction",
#      feedback_type: "neutral_observation",
#      feedback_type_explanation: "Balanced assessment without strong positive or negative bias",
#      message_litote: true,
#      litote_rephrased: "The new update is okay, but it could be better"
#    }

Review Analysis

Extract meaningful insights from informal product reviews:

ruby
review = "This restaurant isn't bad at all! The service wasn't horrible either."
schema = {
  overall_rating: {
    type: 'integer',
    minimum: 1,
    maximum: 5,
    description: 'Overall restaurant rating'
  },
  service_rating: {
    type: 'integer',
    minimum: 1,
    maximum: 5,
    description: 'Service quality rating'
  },
  recommendation: {
    type: 'boolean',
    description: 'Would recommend to others'
  }
}

result = ActiveGenie::Extractor.with_litote(review, schema)
# => {
#      overall_rating: 4,
#      overall_rating_explanation: "Positive sentiment via litote 'isn't bad at all'",
#      service_rating: 3,
#      service_rating_explanation: "Moderate positive via 'wasn't horrible'",
#      recommendation: true,
#      recommendation_explanation: "Overall positive tone suggests recommendation",
#      message_litote: true,
#      litote_rephrased: "This restaurant is quite good! The service was decent too."
#    }

Performance Considerations

⚠️ Processing Impact: The with_litote method performs additional rhetorical analysis, which:

  • Requires 2-3x more processing time than basic extraction
  • May involve multiple AI model calls for complex text
  • Best suited for conversational text under 500 words
  • Consider background processing for high-volume applications

When to Use Litote Processing

✅ Ideal for:

  • Customer feedback and reviews
  • Chat message analysis
  • Social media sentiment analysis
  • Survey responses with informal language
  • User-generated content moderation

❌ Not recommended for:

  • Formal documents or technical texts
  • High-volume real-time processing
  • Simple data extraction without sentiment needs
  • Structured data like product catalogs

Tips & Best Practices

  • Be descriptive with your schemas. Include detailed descriptions for each field to help the AI understand what you're looking for. The more context you provide, the better the extraction quality.

  • Use validation constraints effectively. Leverage enum, minimum, maximum, and other JSON schema constraints to guide extraction and ensure data quality.

  • Handle missing data gracefully. Not all text will contain every field in your schema. The extractor will only return fields it can confidently identify.

  • Optimize schema complexity. While the extractor can handle complex nested schemas, simpler schemas generally produce more reliable results and faster processing.

  • Choose the right method for your use case:

    • Use .call() for straightforward data extraction from structured text
    • Use .with_explanation() when you need reasoning for debugging or transparency
    • Use .with_litote() for informal, conversational, or user-generated content
  • Test with real data. Use representative samples of your actual text data when designing schemas, as real-world text often contains unexpected variations.

  • Consider context windows. For very long texts, consider breaking them into smaller chunks as AI models have token limits that can affect extraction quality.

Advanced Schema Patterns

Multi-Type Fields

Handle fields that could have multiple valid types:

ruby
# Product price that might be a number or "Contact for pricing"
schema = {
  price: {
    oneOf: [
      { type: 'number', minimum: 0 },
      { type: 'string', enum: ['Contact for pricing', 'Price on request', 'TBD'] }
    ]
  }
}

Array Extraction

Extract lists and collections from text:

ruby
text = "The product is available in red, blue, and green colors"
schema = {
  colors: {
    type: 'array',
    items: { type: 'string' },
    description: 'Available color options'
  }
}
# => { colors: ["red", "blue", "green"] }

Conditional Fields

Create context-dependent extraction rules:

ruby
review = "Great hotel! The pool was amazing and the spa was relaxing. Breakfast could be better."
schema = {
  rating: { type: 'integer', minimum: 1, maximum: 5 },
  amenities_mentioned: {
    type: 'array',
    items: {
      type: 'object',
      properties: {
        name: { type: 'string' },
        sentiment: { type: 'string', enum: ['positive', 'negative', 'neutral'] }
      }
    }
  }
}

Interface

.call(text, data_to_extract, config = {})

The primary extraction method that analyzes text and returns structured data based on your schema.

Parameters

NameTypeDescriptionRequiredExample
textStringThe text to analyze and extract data fromYes"John Doe is 25 years old"
data_to_extractHashJSON schema defining the data structure to extractYes{ name: { type: 'string' } }
configHashAdditional extraction configuration optionsNo{ model: "gpt-4", provider_name: "openai" }

Returns

Hash containing extracted values matching the schema structure, with explanation fields when using with_explanation.

Configuration Options

ruby
config = {
  model: "gpt-4o",           # AI model to use
  provider_name: "openai",        # AI provider (openai, anthropic, google, deepseek)
  temperature: 0.1,          # Lower values for more consistent extraction
  max_tokens: 1000,          # Maximum response length
  timeout: 30                # Request timeout in seconds
}

.with_explanation(text, data_to_extract, config = {})

Identical to .call() but includes detailed explanations for each extracted field. Use this method when you need transparency in the extraction process or for debugging purposes.

ruby
# Returns additional *_explanation fields
result = ActiveGenie::Extractor.with_explanation(text, schema)
# => {
#      name: "John Doe",
#      name_explanation: "Found directly at the beginning of the text",
#      age: 25,
#      age_explanation: "Explicitly stated as '25 years old'"
#    }

.with_litote(text, data_to_extract, config = {})

Specialized extraction method that analyzes rhetorical devices and informal language patterns. Best for conversational text, reviews, and user-generated content.

Additional Returns

  • message_litote - Boolean indicating if litotes were detected
  • litote_rephrased - Positive rephrasing of the original text (when applicable)
ruby
result = ActiveGenie::Extractor.with_litote("The weather isn't bad", schema)
# => {
#      mood: "positive",
#      mood_explanation: "Positive sentiment expressed through litote",
#      message_litote: true,
#      litote_rephrased: "The weather is good"
#    }

Response Structure

All extractor methods return a Hash with the following pattern:

ruby
{
  # Extracted fields matching your schema
  field_name: extracted_value,
  
  # Explanation fields (with_explanation and with_litote only)
  field_name_explanation: "Reasoning for extraction",
  
  # Litote-specific fields (with_litote only)
  message_litote: true/false,
  litote_rephrased: "Positive rephrasing of original text"
}

Error Handling

The extractor gracefully handles various error conditions:

  • Missing fields: Fields not found in text are omitted from results
  • Type mismatches: Invalid data types are either converted or omitted
  • API failures: Network issues return partial results when possible
  • Schema validation: Invalid schemas raise descriptive errors
ruby
begin
  result = ActiveGenie::Extractor.call(text, schema)
rescue ActiveGenie::InvalidProviderError => e
  # Handle provider configuration issues
rescue ActiveGenie::ExtractionError => e
  # Handle extraction-specific errors
end

⚠️ Performance Considerations

  • Basic extraction (.call): ~1-3 seconds for typical text
  • With explanations: ~2-5 seconds due to additional reasoning
  • Litote processing: ~3-8 seconds for rhetorical analysis
  • Batch processing: Consider background jobs for high-volume extraction
  • Token limits: Large texts may be truncated; consider chunking for optimal results