Open source tool for analyzing your social media data (want to help me make it better)?

I classified 2,500 posts from Bluesky’s 10 most-followed accounts using an open-source LLM pipeline I built called cat-vader. The classified dataset is now public on my HF profile.

cat-vader is a fork of cat-llm, a package I originally built for classifying open-ended survey responses in academic research. It supports multi-label classification, automatic category discovery, and direct Threads/Bluesky API integration.

Some findings from the analysis:

  1. Account identity explains ~62% of engagement variance
  2. Political and social content outperforms within any given account
  3. Economy posts appear to tank engagement, but the effect disappears once you control for who’s posting

Full writeup: What Bluesky’s Most-Followed Accounts Actually Post About - Chris Soria
GitHub: GitHub - chrissoria/catvader · GitHub

1 Like

Hmm… GitHub link is dead…?

oh whoops: GitHub - chrissoria/catvader · GitHub

1 Like

for now:


Where cat-vader sits in the ecosystem (and why it’s interesting)

cat-vader is effectively a “data → fetch → label (multi-label) → analyze” pipeline with optional category discovery and provider-agnostic LLM backends (plus ensemble voting). Your writeup also highlights a very “social platforms” reality: Bluesky’s API reports views = 0, so engagement analysis ends up centered on likes/replies rather than reach. (Chris Soria)

That combination (platform ingestion + schema-driven multi-label LLM labeling + downstream analysis) has a lot of surface area to improve—and plenty of adjacent projects to borrow patterns from.


Similar projects/attempts worth studying (what to borrow)

1) LLM-assisted labeling + review workflows (human-in-the-loop)

These tools focus on the practical “label at scale, then audit/correct” loop—exactly what social media coding needs.

  • Label Studio (prompt-centric + automation / prelabeling): strong patterns for prelabel → human verify → export, plus prompt workflows for reducing copy/paste overhead. (Label Studio)
  • Argilla + spaCy-LLM tutorials: good reference architecture for “LLM suggestions stored alongside records” and iterative improvement of label quality. (Argilla)
  • Prodigy + LLM recipes: practical “correct the LLM” workflows designed for annotation speed (paid tool, but the docs have reusable concepts). (prodi.gy)

What to borrow for cat-vader

  • A first-class review UI loop (even a lightweight one: streamlit/gradio) where users can correct ambiguous labels and export a “gold” set.
  • Treat the LLM as a suggestion engine + uncertainty flagger, not just the final labeler.

2) Synthetic data + scalable LLM pipeline orchestration

  • Distilabel (Argilla): opinionated framework for building reliable, scalable pipelines for synthetic data and AI feedback. Useful patterns for batching, caching, evaluation hooks, and reproducibility. (GitHub)
  • LLM_Tool: a “research-friendly” end-to-end pipeline for annotating text datasets, tracking benchmarks, and training classifiers. (GitHub)

What to borrow

  • Pipeline primitives: stages, artifacts, run manifests, cached intermediate outputs, evaluation summaries.

3) Weak supervision + ensembles (conceptual match to “multi-model vote”)

  • Snorkel (data programming): classic approach to combining multiple noisy labeling sources (heuristics, models) into a better latent label estimate. (PMC)
  • Prompted weak supervision (Alfred): prompts as labeling functions; relevant if you expand “auto category discovery” into “prompt library of heuristics.” (ACL Anthology)

What to borrow

  • Instead of plain majority vote, add an optional “learned combiner” (even a simple Dawid–Skene-style weighting) so stronger models (or historically reliable models) count more on specific categories.

4) Social-media-specific NLP baselines (non-LLM, but critical for validation)

  • TweetNLP: packaged classifiers for social-media tasks (topic, sentiment, emotion, hate/offensive, irony). (PyPI)
  • BERTweet: canonical tweet-domain pretraining baseline. (GitHub)
  • TweetEval benchmark: widely used evaluation suite for tweet classification tasks. (Hugging Face)

What to borrow

  • Use at least one strong non-LLM baseline (or task-specific transformer) as a sanity check for stability and drift: “Do LLM labels behave wildly differently than a tuned domain model on the same data?”

5) Bluesky / AT Protocol ingestion patterns

  • Official Bluesky developer docs (SDK pointers). (Bluesky Documentation)
  • Community-maintained atproto Python SDK (broad coverage of the protocol). (GitHub)

What to borrow

  • Standard handling for pagination/cursors, session reuse, rate limits, and consistent “record schemas” across endpoints.

Common pitfalls (that show up in your exact use case)

A) Engagement modeling pitfalls

  • No impressions / reach on Bluesky (views = 0) means “likes” conflate content appeal with distribution mechanics you can’t observe. In your writeup you correctly restrict conclusions to likes/replies. (Chris Soria)
  • Heavy-tailed engagement: your approach uses log(likes + 1) and shows that adding account fixed effects jumps R² from ~18% to ~61.9%. That’s a good illustration of “creator identity dominates.” (Chris Soria)
  • Simpson’s paradox / composition effects: your “economy tanks engagement” disappears once you control for who posts—exactly the kind of thing social media analysis constantly hits. (Chris Soria)

Implication for cat-vader: the library should make “composition controls” (fixed effects, stratified reports) easy and default-ish.

B) LLM labeling pitfalls

  • Prompt + schema choices can create systematic label bias (e.g., majority-label bias in-context), especially for multi-label setups where “Other” and borderline categories are frequent. (arXiv)
  • Small category sets can look stable while actually drifting (label semantics shift run-to-run).

C) Product pitfalls (practical adoption)

  • Users abandon tools when:

    • the API is slightly inconsistent across docs/examples,
    • auth/env setup is brittle,
    • outputs aren’t easily auditable.

Your repo already emphasizes best practices like “detailed category descriptions,” and notes that chain-of-thought / step-back prompting didn’t consistently help. (GitHub)


Suggestions for cat-vader (prioritized, concrete)

1) Tighten “first 15 minutes” usability (docs + API consistency)

Why: this is the #1 adoption lever for open source tooling.

  • Unify parameter naming across README + posts + code.
    Your writeup example uses sm_posts=250, but the public API reference is sm_limit. That’s the kind of mismatch that creates immediate friction. (Chris Soria)
  • Update repo docs that still reference CatLLM paths/names.
    ARCHITECTURE.md describes src/catllm/ and modules like summarize, while the cat-vader changelog states summarize was removed and the package renamed. CONTRIBUTING.md is titled “Contributing to CatLLM” and points to cat-llm commands/URLs. (GitHub)
  • Make supported platforms list consistent.
    README lists "threads", "bluesky", "reddit", "mastodon", "youtube", while the changelog includes LinkedIn support. Choose one source of truth and generate the others from it. (GitHub)

Deliverable idea: a single docs/reference.md generated from docstrings (or vice versa) so the examples can’t drift.


2) Fix credential handling and remove machine-specific paths

In _social_media.py there is a hardcoded _ENV_PATH pointing to a local “Documents/Important_Docs/…” location, and error messages instruct setting env vars there. That will break for essentially everyone except you. (GitHub)

What to do

  • Load environment variables from:

    1. os.environ
    2. optional .env in current working directory (standard python-dotenv behavior)
    3. explicit env_path= parameter if you want to support custom locations
  • Ensure all exceptions reference “set env var X” without referencing a personal filesystem path.

This is a high-impact, low-effort PR.


3) Make ingestion robust (pagination, rate limits, retries, provenance)

For Bluesky you already handle cursor pagination and optionally authenticate; good. (GitHub)
Next improvements that matter in real-world runs:

  • Add standardized retry/backoff (429/5xx) across all platforms with jitter.
  • Persist provenance columns (endpoint used, fetched_at, cursor/page, auth_used yes/no).
  • Normalize schema across platforms (e.g., unify “quote/repost/share” semantics per platform and document what is missing or always-zero—Bluesky views/shares are always 0). (GitHub)

4) Treat labeling as an experiment: add “auditability” by default

Right now cat-vader supports multi-model ensemble voting and returns per-model outputs + consensus columns. That’s a strong base. (GitHub)

Add these defaults so users can trust results:

  • Store the prompt + schema + model config used (model name, provider, temperature/creativity, thinking budget) in a run manifest (JSON) saved next to outputs.

  • Add a “disagreement report”:

    • rows where models disagree,
    • categories with low agreement,
    • “most confusing pairs” (A vs B).
  • Add a small “gold set” evaluator:

    • user supplies 100 hand-coded posts → cat-vader outputs precision/recall per label + calibration plots (even basic).

This matches what labeling platforms and weak supervision systems emphasize: the workflow is label → check → refine, not “label once.” (Label Studio)


5) Improve category discovery so it produces reusable taxonomies

Your approach (descriptions per category + “auto” category discovery) is good. (GitHub)
Two upgrades make it more “research-grade”:

  • Stability analysis baked-in: run discovery multiple times and compute overlap (Jaccard) + naming alignment.

  • Hierarchy support: let users define:

    • a coarse taxonomy (Politics / News / Culture / Personal)
    • optional sublabels (Politics → Elections / Policy / Scandals)

This mirrors shared tasks that treat social labels as multi-label and often hierarchical. (ceur-ws.org)


6) Ship opinionated “analysis helpers” for the exact effect you found

Your headline result—identity dominates engagement variance (R² ~61.9% with fixed effects)—is compelling because it’s the right model for this setting. (Chris Soria)

Codify that into the package:

  • engagement_report(df, y="likes", log1p=True, fixed_effect="handle")

  • default outputs:

    • content-only model vs fixed-effects model comparison
    • within-account category uplift table
    • “composition check”: raw category averages vs within-account estimates (the economy paradox case) (Chris Soria)

Also: cat-vader already started adding useful covariates like day, month, hour, n_posts_that_day, post_length, contains_url, contains_image. Make those first-class in analysis helpers. (GitHub)


A practical “next PRs” roadmap

Highest leverage (adoption + correctness)

  1. Remove hardcoded _ENV_PATH; standardize dotenv/env handling. (GitHub)
  2. Fix doc drift (sm_posts vs sm_limit, CatLLM references, supported platforms list). (Chris Soria)

Next (trust + research usefulness)

  1. Add run manifests + disagreement reports + “gold set” evaluator. (GitHub)
  2. Add engagement helpers that default to fixed-effects comparisons (codify your main analytic insight). (Chris Soria)

Later (scale + extensibility)

  1. Optional learned ensemble combiner (Snorkel-style weighting). (arXiv)
  2. Small review UI loop (Label Studio/Argilla-inspired patterns). (Label Studio)

Reading list (directly relevant to your tool)

  • Snorkel / weak supervision (core ensemble theory): Ratner et al. (Snorkel). (PMC)
  • Prompted weak supervision: Alfred (prompted labeling functions). (ACL Anthology)
  • LLM-in-the-loop labeling workflows: Label Studio prompt-centric workflow; Argilla + spaCy-LLM tutorial. (Label Studio)
  • Social media NLP baselines: TweetNLP + BERTweet + TweetEval references. (PyPI)
  • Bluesky ingestion: Bluesky docs + atproto Python SDK. (Bluesky Documentation)

A “north star” framing for cat-vader

If cat-vader becomes the tool where a researcher can:

  1. fetch posts from a platform,
  2. label them with auditable multi-label outputs,
  3. review disagreements quickly,
  4. produce fixed-effects-aware engagement results by default,

…then it’s not just “LLM classification,” it’s a reproducible social media coding workbench—and your Bluesky result (identity dominates; content still matters within identity) becomes a built-in, repeatable analysis template. (Chris Soria)