Building out directory site content with AI agents

Contents

One Agent is Not Enough
Data models
The agents
Closing thoughts

While building a recent hobby project, Clean Travel Club, I wanted to get some initial content loaded into the site without doing it all manually.

This was a great opportunity to play around with AI agents and their tooling.

Clean Travel Club rates restaurants, coffee shops, hotels, and more on a bunch of health-related attributes. Think: “Yelp, but healthy.”

You can filter for places like the healthiest restaurants in New York City, or seed-oil-free restaurants in San Francisco, or organic coffee shops in Austin with outdoor seating. (And soon, lots more — hotels with non-toxic cleaning products and blackout shades, the best saunas or massages, more city coverage, etc.)

My goal was to have the AI do a first pass on collecting all this information and put it on the site, and we’d then go through and verify it over time.

There are a bunch of ways of varying technical complexity to go about this sort of task, including simply interfacing with Perplexity/ChatGPT/Claude, or using low- or no-code tools. But this post takes a reasonably technical approach involving writing code to interface with the APIs.

One Agent is Not Enough

The first challenge: I wanted up-to-date information, which meant using an “online” LLM that has access to the internet (most LLM APIs only have access to the LLM’s internal knowledge base, not the live internet). Perplexity’s API was the obvious choice.

But Perplexity’s LLMs are missing a feature I also needed: “structured outputs.” Structured outputs allows you to ensure that the agent outputs the data in a specific format. This is useful for ensuring the output matches your database schema. If you don’t use structured outputs, you risk the LLM adding a new parameter, changing the name of a parameter, or ignoring a parameter entirely — and then adding the data to your system could fail or be incomplete.

OpenAI offers structured outputs. But the OpenAI API models are not online.

The answer: a multi-agent system. I used CrewAI — which seems like the easiest tool for a proof-of-concept — to orchestrate the two models.

I’d ask the Perplexity agent to go out and get live information, then hand it off to the OpenAI agent who would structure it correctly before returning it to my program.

Here’s the basic sketch (again, simplified — Clean Travel Club does a bunch more than this) of the sequence:

I input a list of places (e.g. a list of restaurants in NYC)
One-by-one, my program runs through the places
First it queries the Google Places API to get baseline information
Then it kicks off a CrewAI “crew”
That crew begins with a Perplexity prompt asking about all the specific desired attributes of the place (accessing the internet to get the most up-to-date information)
Then the crew moves those results to the OpenAI agent to structure it (again with specific instructions)
And finally it all gets inserted into the database

Ready to dive into some code examples?

Data models

I built Clean Travel Club in Python / Django (with the help of the truly excellent SaaS Pegasus boilerplate!), but you could do something similar with any language or framework.

First I had to set up my data models with the desired parameters. I have a Django model for Restaurant and then a Pydantic model for the data that goes into it (this is OpenAI’s preferred format). Here’s abridged/simplified excerpt from models.pyNote: I’m only showing simplified versions of the core, material logic to illustrate the playbook for using agents to get data. There’s a bunch more complexity around data/schema munging, specifying other parameters (city, other types of places, other attributes, attribute categories, etc.) and more — but in the interest of keeping this post and code snippets to a reasonable length, I’ve removed all that. If there’s enough interest, I’d clean up my fuller code and publish it.:

# apps/places/models.py

from pydantic import BaseModel as PydanticBaseModel

# A Pydantic model that will be used inside the attribute to allow for Boolean values with a None option, and notes on that value. 
# I also have a similar RatingField in the real product for values that aren't boolean.
class BooleanField(PydanticBaseModel):
    value: bool | None = Field(description="Boolean value of the attribute")
    notes: str | None = Field(description="Notes about the attribute")

# A basic version of the restaurant data schema. 
# In the real application, this is several layers of nested Pydantic models and output data.
class RestaurantAttributes(PydanticBaseModel):
    seed_oil_free: BooleanField = Field(description="Whether the restaurant is seed oil free")
    grass_fed: BooleanField = Field(description="Whether the establishment has grass-fed meats")

# Additional metadata — both for the app to display and for the agents to ingest
# You'll see how the *_notes fields are used in a minute.
restaurant_attribute_mapping = {
    "seed_oil_free": {
        "display_name": "Seed oil free food", 
        "ai_structure_notes": "Does the establishment specifically make its food WITHOUT SEED OILS? Either True or False. You can also be unsure (None)."
    },
    "grass_fed": {
        "display_name": "Grass-fed meats", 
        "ai_research_notes": "This is specifically about GRASS-FED meats on the menu. Simply having meat in general is not enough.", 
        "ai_structure_notes": "Does the establishment have grass-fed meats? Either True or False. You can also be unsure (None)."
    }
}

# The core model for Restaurants
class Restaurant(BaseModel):
    name = models.CharField(max_length=100)
    slug = models.SlugField(unique=True, null=True)
    google_place_id = models.CharField(max_length=255, blank=True)
    google_place_data = JSONField(
        default=dict, blank=True, null=True, 
        schema={"type": "object", "keys": {}}
    )

    # This is a data-munging function I define elsewhere that converts the Pydantic model to a JSONField. 
    # You could also use django-pydantic-field or other approaches, including using JSON the whole way through
    SCHEMA = pydantic_to_jsonfield_schema(RestaurantAttributes, restaurant_name_mapping) 

    attributes = JSONField(schema=SCHEMA)

The agents

Now that we have our structure, let’s set up our agents. Here are the basic components of the key logicAgain: leaving lots of details out of this to keep the post simple and focused on the agents..

First I get the data out of the input file:

# apps/places/management/commands/collect_data.py

class Command(BaseCommand):
    help = """
    Researches places using Google Places API, Perplexity, and OpenAI, and inserts them into the database.
    Usage: 
        Batch search: $ docker compose exec web python manage.py collect_data --file <path_to_file>
    Batch file format is one search_query per line.
    """
        
    def add_arguments(self, parser):
        parser.add_argument('--file', type=str, help='Path to file containing search queries (one per line)')

    def handle(self, *args, **options):
        file_path = options['file']
        with open(file_path, 'r') as f:
            queries = [line.strip() for line in f if line.strip()]
            for query in queries:
                place = self.run_search(query)
                if place:
                    self.stdout.write(self.style.SUCCESS(f'Successfully saved {place.name}'))
                else:
                    self.stdout.write(self.style.ERROR(f'Failed to process {query}'))

Then I find the place in Google Places, and kick off the agent crew:

# apps/places/management/commands/collect_data.py:Command

# all these snippets are still in the Command class

def run_search(self, search_query):
    gmaps_api_key = getattr(settings, 'GOOGLE_MAPS_API_KEY', None)
    gmaps = googlemaps.Client(key=gmaps_api_key)
    google_place = gmaps.places(f"{search_query}, New York City")['results'][0]
    google_place_details = gmaps.place(google_place['place_id'])['result']
		
    crew_ouput = self.run_crew(schema=RestaurantAttributes, inputs={'place_name': search_query})
		
    place = self.save_place(search_query, google_place_details, crew_output)

    return place

Now, what you’ll see below is how the crew runs.

The first function (generate_prompts) creates schema-specific prompts for the agents with clarifying information for them. These prompts take in the attributes we defined in models.py and create a prompt inquiring about them, including adding parenthetical notes from the restaurant_attribute_mapping if they exist.

The notes are useful for making sure the agents know clearly what they’re looking for, as well as ensuring the OpenAI structuring agent knows data format you’re looking for (otherwise sometimes it guesses wrong and the task fails when its output doesn’t match the designated structured format)I’ve found it really interesting to be working both in the deterministic world of “normal code” and the non-deterministic world of “LLM prompting” in the same project. The “normal code” (usually) either consistently works, or consistently fails. But the prompting requires coaxing and tweaking to get it to reliably return the desired outcome. It’s fun and sometimes frustrating.

My prompts here could probably be 10x better with some few-shot examples and other adjustments..

# apps/places/management/commands/collect_data.py:Command

def generate_prompts(self, place_name, attribute_mapping):
    research_body = ''
    structure_body = ''

    for attribute, attribute_data in attribute_mapping.items():
        research_body += f'- it has {attribute_data['display_name']}'
        structure_body += f'- it has {attribute_data['display_name']}'
        if 'ai_research_notes' in attribute_data:
            research_body += f' ({attribute_data["ai_research_notes"]})'
        if 'ai_structure_notes' in attribute_data:
            structure_body += f' ({attribute_data["ai_structure_notes"]})'
        research_body += '\n'
        structure_body += '\n'

    research_prompt = f"""
    Research {place_name} to learn about how healthy it is and specifically about each of the following attributes. 
    Briefly include any relevant information for each:
    {research_body}
    - as well as how expensive it is — either low-priced ($), mid-priced ($$), high-priced ($$$), or very high-priced ($$$$)
    - as well as a brief general description of the establishment, specifically anything about its healthiness.
    """

    research_expected_output = f"""Bullet points about each of the following attributes:
    {research_body}
    - as well as how expensive it is — either low-priced ($), mid-priced ($$), high-priced ($$$), or very high-priced ($$$$)
    - as well as a brief general description of the establishment, specifically anything about its healthiness.
    """

    structure_prompt = f"""Take information about {place_name} and return it in the defined structured format. 
    If you are unsure, None is an acceptable answer for both the attribute value and the notes — don't hesitate to leave them as None if you aren't sure!
    For the notes, keep them brief and to the point, with a 'smart friend' tone. 
    These will be on a site where people are trying to find healthy establishments, so give context relevant to that and the specific attribute.
    Do NOT say "Yes" or "No" — we're looking for True or False or None.
    Do NOT put an attribute as True simply because it's "likely" due to similar places in that category having that attribute. You should just put None.
		
    We are looking for the following information:
    {structure_body}
    - Whether it is $
    - Whether it is $$
    - Whether it is $$$
    - Whether it is $$$$
    - And how you would describe it in a sentence or two for overall notes.
		
    Overall notes should be just 1 sentence, or 2 at most. And not too salesy, just a smart friend telling you about the place. 
    Do NOT put None in the Overall Notes, but you can put it in any other field.

    Also, if you include the place's name anywhere, use a short version of the name, like "The Well" instead of "The Well Kitchen & Table", or "Sweetgreen" instead of "Sweetgreen - 12th + University."
    """

    structure_expected_output = f"""
    Each of the items below about the establishment {place_name}, and notes about each.

    If you are unsure on a specific item, put None as the answer for both the attribute value and the notes. Don't hesitate to leave it as None if you aren't sure! Include whether:

    {structure_body}
    - Whether it is $
    - Whether it is $$
    - Whether it is $$$
    - Whether it is $$$$
    - And how you would describe it in a sentence or two for overall notes.
    """

    return research_prompt, research_expected_output, structure_prompt, structure_expected_output

Here’s an example of the research prompt that gets generated:

The second function (run_crew) is what we called in the last code snippet, and references those generated prompts.

# apps/places/management/commands/collect_data.py:Command

def run_crew(self, schema, inputs):
    place_name = inputs['place_name']

    research_prompt, research_expected_output, structure_prompt, structure_expected_output = self.generate_prompts(place_name, restaurant_attribute_mapping)

    agents = []

    researcher = Agent(
        role=f"{place_name} Senior Data Researcher",
        goal=f"Discover and document requested information about {place_name}",
        backstory=f"You're a seasoned researcher with a knack for uncovering requested information. You are known for your ability to find the most relevant information requested and present it in a clear and concise manner.",
        verbose=True,
        llm = LLM(
            # I use model llama-3.1-sonar-large-128k-online but am experimenting with whether small could do the trick too (or if huge is better) 
            model=getattr(settings, 'PERPLEXITY_MODEL', None), 
            base_url="https://api.perplexity.ai/",
            api_key=getattr(settings, 'PERPLEXITY_API_KEY', None)
        )
    )
    agents.append(researcher)

    structurer = Agent(
        role=f"{place_name} Data Structuring Specialist",
        goal=f"Take information about {place_name} and return it in the defined structured format",
        backstory=f"You're a meticulous data structuring specialist with a keen eye for detail. You take information and return it in the defined structured format.",
        verbose=True,
        llm = LLM(
            # I use gpt-4o-mini
            model=getattr(settings, 'OPENAI_MODEL', None), 
            api_key=getattr(settings, 'OPENAI_API_KEY', None)
        )
    )
    agents.append(structurer)

    tasks = []

    research_task = Task(
        description=research_prompt,
        agent=researcher,
        expected_output=research_expected_output
    )
    tasks.append(research_task)

    structure_task = Task(
        description=structure_prompt,
        agent=structurer,
        expected_output=structure_expected_output,
        output_json=RestaurantAttributes
    )
    tasks.append(structure_task)

    crew = Crew(
        agents=agents,
        tasks=tasks,
        process=Process.sequential,
        verbose=True,
    )

    result = crew.kickoff(inputs=inputs)
    task_output = structured_return_task.output.json_dict

    return task_output

That’s most of the core logic! First put together the prompts (lots of words, but not actually that much going on), then set up the agents and their tasks, kick them off, and get the results. I was shocked how easy this was the first time I ran it.

Finally, we wrap up with the save_place function we referenced earlier:

# apps/places/management/commands/collect_data.py:Command

def save_place(self, search_query, google_place_details, crew_output):
    place_name = google_place_details.get('name', search_query)
    
    # another data munging function that puts the crew output into the right format to be inserted
    attribute_data = self.format_attribute_data(Restaurant.SCHEMA, crew_output) 
    
    place = Restaurant.objects.create(
        name=place_name,
        slug=slugify(place_name),
        google_place_id=google_place_details.get('place_id', None),
        google_place_details=google_place_details,
        attributes=attribute_data
    )

    place.save()

    return place

And with that: it’s all done.

So again, when I run this, it takes my list of search queries, finds them on Google Places to get basic metadata, then runs to Perplexity to get up-to-date information about the requested attributes, then goes to OpenAI to structure it all, and then dumps it all right in the database.

Closing thoughts

When I first ran this, I found it pretty magical. I can simply input a list of places, and for a few cents and a couple minutes of waiting, I can get back a whole load of specified data about them, in exactly the format I need, and insert it straight into the site.

Next up on the data front, aside from improving Clean Travel Club overall, I’m thinking about combining all this with other data sources (like I already do with the Google Maps data), or even using AI agents (phone or email) to reach out directly to places to verify the information and get even more that’s not on the internet.

The possibilities for this are, obviously, endless. There are so many useful directory-style sites you could make. There are b2b products you could make. There are personal tools you could make.

It used to be that in order to pull together a big load of structured data, you either needed to find a source of the data already-structured and buy or scrape it, or you needed to write massively complex scrapers, or you needed to manually extract it and input it (I’ve done all three, and none are particularly fun).

Now, with just a couple hundred lines of code, you can generate it all, as long as that data exists on the internet.

Obviously, the LLMs themselves are indistinguishable-from-magic technology. But I’ve been equally blown away by the quality of open-source tooling around them. Whether it’s CrewAI orchestrating the agents, or LiteLLM handling connections to them behind the scenes, it’s remarkable that tools exist with such good ergonomics so early in this technology wave.

The speed of development in the AI space is ridiculous right now — I’m looking forward to this post being horribly out-of-date in a month’s time. Onwards.

Looking for more to read?

Want to hear about new essays? Subscribe to my roughly-monthly newsletter recapping my recent writing and things I'm enjoying:

And I'd love to hear from you directly: andy@andybromberg.com