We migrated conselara.dev from a self-hosted nginx container on a local server to S3 + CloudFront. The motivation was simple: a static site has no business running on a server we have to patch. The migration took a few hours and involved four gotchas that aren’t obvious from the AWS documentation. This is a record of what we did and what tripped us up. The setup Hugo static site (PaperMod theme) S3 bucket with all public access blocked — Origin Access Control (OAC) only CloudFront distribution with ACM SSL cert Cloudflare DNS, gray cloud (DNS-only) Gitea self-hosted repo with a webhook-triggered deploy container on-prem The deploy flow on push: Gitea fires a webhook → container on saturn pulls the repo, runs hugo --minify, syncs to S3, invalidates CloudFront. ...
vLLM on DGX Spark: What the SM121 Architecture Actually Requires
The DGX Spark GB10 runs SM121 — the Grace Blackwell Superchip. It is not the same silicon as datacenter Blackwell (SM100, H100/H200). SM121 lacks TMEM, WGMMA, DSMEM, and NVSwitch. Several vLLM defaults, forum recommendations, and NVIDIA docs written for datacenter Blackwell do not apply, and some actively break things on SM121. This is a reference for what we learned running vLLM 0.19.0 (NGC container nvcr.io/nvidia/vllm:26.04-py3) on two DGX Sparks — single-node and two-node cluster configurations. ...
We Replaced an MCP Server with FastAPI and It Worked Everywhere
We built an internal knowledge base server to give our AI agents access to Conselara’s company data — capabilities, past performance, GSA rates, certifications. The idea was straightforward: expose it as an MCP server so any AI client could query it semantically. It worked in Claude Code. It worked nowhere else. What MCP promises The Model Context Protocol is Anthropic’s open standard for connecting AI models to external tools and data sources. The pitch is compelling: define your server once, and any MCP-compatible client can call it. Claude Code has native MCP support. The ecosystem is growing. ...
AI Across a Health Research Information Platform
We are integrating AI across several workstreams on a federal health research information platform. Publication discovery — using LLMs to surface relevant PubMed research, reducing manual literature review time and improving coverage across a high-volume publication landscape. LLM comparative evaluations — running structured benchmarks across models to assess quality, consistency, and cost for specific content tasks on the platform. Evaluations are task-specific rather than general — we score against real outputs the platform needs to produce. ...
DGX Spark Benchmark Results: vLLM on SM121
Measured throughput and latency on DGX Spark GB10 (SM121) hardware. All results use vLLM 0.19.0 (NGC container nvcr.io/nvidia/vllm:26.04-py3) unless noted. Qwen3-235B-A22B-GPTQ-Int4 — Two-node cluster Date: 2026-05-03 Config: TP=2, EP=2, Ray cluster over QSFP-DD RoCE direct interconnect, --attention-backend=TRITON_ATTN, --quantization=gptq_marlin, --kv-cache-dtype=fp8, --gpu-memory-utilization=0.87 Batch Avg completion tokens tok/s per request Aggregate tok/s 1 (serial) 256 17.0 17.0 2 (concurrent) 256 12.1 24.1 4 (concurrent) 256 9.1 36.4 Prefix cache: 97% delta hit rate on repeated system prompt. Startup to first inference: ~15 minutes (Ray init + weight load across two nodes + compile). Weight resident per node: 57.64 GiB. ...
DGX Spark Model Comparison: What Fits and What Runs (SM121, 128 GB)
Quick-reference comparison of open-weight models for a single DGX Spark GB10 (SM121, 128 GB unified LPDDR5X memory). Based on tested configurations and community results as of May 2026. Model Architecture Quantization Memory Expected tok/s SM121 notes Qwen3.6-35B-A3B Pure MoE (3B active) FP8 (~35 GB) ✅ easily 100+ Pure MoE, no GDN — fully supported Qwen3.6-27B Dense hybrid (GDN) FP8 (~28 GB) ✅ easily 14–21 (stock) / 136–200 (fork) GDN kernel gap; experimental fork needed for full speed Qwen3-30B-A3B Pure MoE (3.3B active) NVFP4 / FP8 / BF16 (~16–60 GB) ✅ easily 32–50 Solid single-node option; no GDN gpt-oss-120b Sparse MoE (5.1B active) mxfp4 (~61 GB) ✅ 32–60 128K context; proprietary quant format Qwen3.5-122B-A10B Pure MoE (10B active) NVFP4 only (~75 GB) ✅ up to 51 BF16 is 234 GB — does not fit; NVFP4 is the only path Qwen3-235B-A22B Pure MoE (22B active) GPTQ-Int4 (~60 GB/node) ✅ (two nodes) 17–36 agg Requires two DGX Sparks; best quality available Qwen3.5-397B-A17B Pure MoE (17B active) NVFP4 (TP=2) ✅ (two nodes) Unknown SM121 MoE kernel not yet optimized; not recommended Key observations Throughput vs quality tradeoff at single-node: Qwen3.6-35B-A3B gives the highest throughput (100+ tok/s) with pure MoE architecture. Qwen3.5-122B-A10B gives the most capable model (10B active parameters) that fits on one node, at 51 tok/s. For most agentic workloads the bottleneck is tool latency, not token generation — so 51 tok/s is more than sufficient. ...
Piloting AWS DevOps Guru and Amazon Q for AIOps
We are running a pilot of AWS DevOps Guru paired with Amazon Q across a federal AWS estate. DevOps Guru provides ML-driven anomaly detection and automated root cause analysis. Rather than relying on manually defined alert thresholds, it builds a baseline from operational data and flags deviations — reducing noise and surfacing issues that threshold-based alerting misses. Amazon Q brings generative AI into engineer troubleshooting workflows. When an anomaly is flagged, engineers can query Amazon Q directly for accelerated diagnosis — pulling in relevant runbooks, log context, and suggested remediation paths without switching tools. ...
vLLM Model Selection for DGX Spark (SM121)
The DGX Spark GB10 SoC (SM121) has specific constraints that determine which models run well and which don’t. This is a practical guide based on what we’ve tested in production. The key constraint: SM121 kernel compatibility Not all model architectures run well on SM121 with the NGC vLLM container. The main constraint is the MoE kernel: Marlin kernel — stable, fast, supports GPTQ-Int4 and mxfp4 CUTLASS FP4 — broken on SM121, produces garbage outputs silently; never use GDN (GatedDeltaNet) — kernel gap on SM121, 14–21 tok/s with stock NGC; requires experimental fork for full speed Prefer pure MoE models over dense hybrid architectures when using the NGC container. Pure MoE (no GDN/Mamba layers) runs fully through Marlin and is well-tested on SM121. ...
SearXNG: Engine Selection for Reliable Results
If you’re running SearXNG as a self-hosted search backend for an automated pipeline, the default engine selection will cause you problems quickly. Here’s what we’ve found running SearXNG 24/7 for a federal procurement intelligence pipeline. What doesn’t work Google — returns 403 Forbidden for bot-detected requests. Happens immediately on most self-hosted instances without aggressive Cloudflare bypass configuration. Don’t rely on it for automated queries. Startpage — CAPTCHAs after a few queries. Fine for occasional manual searches, unusable for scheduled pipelines. ...
Running Qwen3.5-122B on a Single DGX Spark
The NVIDIA DGX Spark (GB10 SoC, 128GB unified LPDDR5X memory) can run Qwen3.5-122B-A10B — a 122B parameter MoE model — at usable throughput for production workloads. Here’s what it actually takes. The key constraint: NVFP4 only Qwen3.5-122B-A10B at full precision is ~250GB. In NVFP4 quantization it’s ~75GB, which fits comfortably in 128GB unified memory. There is no other quantization path that both fits and runs correctly on the GB10. The only verified checkpoint we’ve found: bjk110/SPARK_Qwen3.5-122B-A10B-NVFP4 on HuggingFace, which includes 15 patches for the SM121 architecture. Use this; don’t try to quantize the base model yourself unless you’re prepared to debug SM121-specific kernel failures. ...