diff --git a/README.md b/README.md index 0126a70..cfbeb3d 100644 --- a/README.md +++ b/README.md @@ -24,16 +24,32 @@ src/ ## Configuração rápida 1) Crie o ambiente e instale dependências: - - `pip install -r requirements.txt` -2) Configure variáveis de ambiente (exemplo em `sample.env`). + - `pip install -r requirements.txt` +2) Configure variáveis de ambiente (exemplo em `sample.env` ou `.env`). 3) Execute o servidor: - - `PYTHONPATH=src python -m main` + - `./scripts/start.sh` Acesse: - `http://localhost:8204/docs` (Swagger UI) - `http://localhost:8204` (informações básicas da API) +## UI local (Agent UI) + +Use o **Agent UI** (agno-agi/agent-ui) como front local: + +1) Instale com o script oficial: + +- `npx create-agent-ui@latest` + +1) Inicie a UI: + +- `pnpm dev` + +1) Abra `http://localhost:3000` e ajuste o endpoint para `http://localhost:8204`. + +Opcional: se o AgentOS usar autenticação, configure `OS_SECURITY_KEY` conforme o README do Agent UI. + ## Fluxo do time 1) **Gestor** recebe a requisição e valida o contexto (banco + SQL). @@ -43,7 +59,23 @@ Acesse: 5) **Conservative Analyst** (se solicitado) gera análise sem reescrever a query. 6) **Gestor** consolida e entrega. +## RAG (KB interna) + +- Coloque documentos em `kb/` (md/txt/sql/pdf). +- O RAG local usa Chroma + SentenceTransformers. +- Variáveis principais: + - `SQL_OPT_KB_PATH`, `SQL_OPT_KB_CHROMA_PATH`, `SQL_OPT_KB_DB_FILE` + - `SQL_OPT_KB_EMBEDDER_ID` + - `SQL_OPT_BLOCK_EXTERNAL_TOOLS=true` bloqueia ferramentas externas. + +## Observabilidade de tokens/custos + +- Ative com `LLM_LOG_USAGE=true`. +- Defina preços (USD por 1K tokens) com: + - `LLM_COST_INPUT_PER_1K` + - `LLM_COST_OUTPUT_PER_1K` + ## Observações -- Use o modelo configurado em variáveis de ambiente (ex.: OpenAI, Gemini, Groq, etc.). +- Use o provedor configurado em `.env` (ex.: Ollama local, OpenAI, Gemini, Groq, etc.). - O time é colaborativo e mantém histórico em SQLite (configurável via env). diff --git a/docs/proposta-arquitetura-agno.md b/docs/proposta-arquitetura-agno.md index 5c195a0..a98270b 100644 --- a/docs/proposta-arquitetura-agno.md +++ b/docs/proposta-arquitetura-agno.md @@ -124,6 +124,11 @@ Recomendação de ferramentas de mercado: - **Langfuse** ou **Phoenix** para rastreio de prompts, custos e latência. - **Grafana/Prometheus** para dashboards executivos. +Status no POC: + +- **Logging de tokens/custos** já implementado via `LLM_LOG_USAGE` e custos por 1K tokens. +- Métricas persistentes e dashboards (Grafana/Prometheus) permanecem como evolução. + Métricas mínimas: - Tokens por request e por área. @@ -150,6 +155,11 @@ Métricas mínimas: - Curadoria contínua com feedback dos times para melhorar a relevância. - **Aumento de precisão**: respostas consistentes com políticas internas e padrões técnicos. +Status no POC: + +- **RAG local** com base interna em `kb/` usando Chroma + SentenceTransformers. +- **Bloqueio de ferramentas externas** por padrão via `SQL_OPT_BLOCK_EXTERNAL_TOOLS=true`. + ## 10) Stack definitiva (100% Agno) - **Agno** como framework único para orquestração, memória e tools. diff --git a/kb/README.md b/kb/README.md new file mode 100644 index 0000000..932d787 --- /dev/null +++ b/kb/README.md @@ -0,0 +1,18 @@ +# Base de Conhecimento Interna (KB) + +Coloque aqui documentos internos que devem ser usados no RAG. + +Suportado (por padrão): + +- Markdown (.md) +- Texto (.txt) +- SQL (.sql) +- PDF (.pdf) + +Configurações via ambiente: + +- SQL_OPT_KB_PATH (padrão: kb) +- SQL_OPT_KB_CHROMA_PATH (padrão: tmp/kb_chroma) +- SQL_OPT_KB_EMBEDDER_ID (padrão: sentence-transformers/all-MiniLM-L6-v2) +- SQL_OPT_KB_DB_FILE (padrão: tmp/sql_optimizer_kb.db) +- SQL_OPT_BLOCK_EXTERNAL_TOOLS (padrão: true) diff --git a/requirements.txt b/requirements.txt index 37b04de..eee1504 100644 --- a/requirements.txt +++ b/requirements.txt @@ -30,3 +30,7 @@ oracledb==3.4.1 pymssql==2.3.11 sqlparse==0.5.5 sqlglot==28.6.0 + +# RAG (local KB) +chromadb==0.6.3 +sentence-transformers==3.4.1 diff --git a/sample.env b/sample.env index a54f0bc..d2cd6e5 100644 --- a/sample.env +++ b/sample.env @@ -15,3 +15,15 @@ # SQL Optimizer Team SQL_OPT_TEAM_DB_FILE=tmp/sql_optimizer_team.db SQL_OPT_TEAM_DEBUG_MODE=false + +# Observabilidade de tokens/custos +LLM_LOG_USAGE=true +LLM_COST_INPUT_PER_1K=0 +LLM_COST_OUTPUT_PER_1K=0 + +# RAG / KB interna +SQL_OPT_KB_PATH=kb +SQL_OPT_KB_CHROMA_PATH=tmp/kb_chroma +SQL_OPT_KB_DB_FILE=tmp/sql_optimizer_kb.db +SQL_OPT_KB_EMBEDDER_ID=sentence-transformers/all-MiniLM-L6-v2 +SQL_OPT_BLOCK_EXTERNAL_TOOLS=true diff --git a/src/sql_optimizer_team/agents/conservative_analysis_agent.py b/src/sql_optimizer_team/agents/conservative_analysis_agent.py index 9772d42..882947b 100644 --- a/src/sql_optimizer_team/agents/conservative_analysis_agent.py +++ b/src/sql_optimizer_team/agents/conservative_analysis_agent.py @@ -7,6 +7,53 @@ import os base_model = get_model() +CONSERVATIVE_ANALYSIS_PROMPT = """ +You are an expert $database_name database analyst and performance specialist. + +Your task is to ANALYZE the SQL query below and provide SUGGESTIONS for improvement. + +⚠️ CRITICAL: You must NOT rewrite or modify the query. Only provide analysis and suggestions. + +$database_name SQL Query: +```sql +$query +``` + +Query Complexity Information: +- Columns: $column_count +- Tables: $table_count +- Subqueries: $subquery_count +- CASE statements: $case_count +- JOINs: $join_count +- Complexity Level: $complexity_level + +Provide your analysis in the following structured format: + +## PERFORMANCE ISSUES +List each performance issue found, with severity (CRITICAL/HIGH/MEDIUM/LOW): +- [SEVERITY] Issue description +- [SEVERITY] Issue description + +## SUGGESTED INDEXES +List indexes that could improve this query: +- CREATE INDEX idx_name ON table(columns) -- Reason + +## OPTIMIZATION SUGGESTIONS +List specific suggestions WITHOUT rewriting the query: +- Suggestion 1: Description of what could be improved and why +- Suggestion 2: Description of what could be improved and why + +## RISK ASSESSMENT +- WITH (NOLOCK) usage: [Yes/No] - If yes, explain the risks +- Missing WHERE clause: [Yes/No] - If yes, explain the impact +- Implicit conversions: [Yes/No] - If yes, list them + +## SUMMARY +Brief summary of the most important findings and priority order for addressing them. + +Remember: DO NOT provide a rewritten query. Only analysis and suggestions. +""".strip() + _db_path = os.getenv("SQL_OPT_TEAM_DB_FILE", "tmp/sql_optimizer_team.db") _debug_mode = os.getenv("SQL_OPT_TEAM_DEBUG_MODE", "false").strip().lower() in {"1", "true", "yes", "on"} @@ -30,52 +77,7 @@ conservative_analysis_agent = Agent( "- Solicite banco e SQL se não estiverem presentes.", "- Se o usuário fornecer um caminho de arquivo, use load_sql_from_file().", "- Use a template oficial abaixo para a análise conservadora (sem reescrever a SQL).", - """ - You are an expert $database_name database analyst and performance specialist. - - Your task is to ANALYZE the SQL query below and provide SUGGESTIONS for improvement. - - ⚠️ CRITICAL: You must NOT rewrite or modify the query. Only provide analysis and suggestions. - - $database_name SQL Query: - ```sql - $query - ``` - - Query Complexity Information: - - Columns: $column_count - - Tables: $table_count - - Subqueries: $subquery_count - - CASE statements: $case_count - - JOINs: $join_count - - Complexity Level: $complexity_level - - Provide your analysis in the following structured format: - - ## PERFORMANCE ISSUES - List each performance issue found, with severity (CRITICAL/HIGH/MEDIUM/LOW): - - [SEVERITY] Issue description - - [SEVERITY] Issue description - - ## SUGGESTED INDEXES - List indexes that could improve this query: - - CREATE INDEX idx_name ON table(columns) -- Reason - - ## OPTIMIZATION SUGGESTIONS - List specific suggestions WITHOUT rewriting the query: - - Suggestion 1: Description of what could be improved and why - - Suggestion 2: Description of what could be improved and why - - ## RISK ASSESSMENT - - WITH (NOLOCK) usage: [Yes/No] - If yes, explain the risks - - Missing WHERE clause: [Yes/No] - If yes, explain the impact - - Implicit conversions: [Yes/No] - If yes, list them - - ## SUMMARY - Brief summary of the most important findings and priority order for addressing them. - - Remember: DO NOT provide a rewritten query. Only analysis and suggestions. - """.strip(), + CONSERVATIVE_ANALYSIS_PROMPT, "- NÃO reescreva a SQL em hipótese alguma.", ], ) diff --git a/src/sql_optimizer_team/agents/sql_analyst_agent.py b/src/sql_optimizer_team/agents/sql_analyst_agent.py index 961c3e3..ec9edee 100644 --- a/src/sql_optimizer_team/agents/sql_analyst_agent.py +++ b/src/sql_optimizer_team/agents/sql_analyst_agent.py @@ -1,18 +1,101 @@ from agno.agent import Agent from agno.db.sqlite import SqliteDb from sql_optimizer_team.tools.engine.model_selector import get_model -from sql_optimizer_team.tools.core_tools import explain_query_core from sql_optimizer_team.tools.prompt_tools import supported_databases from sql_optimizer_team.tools.sql_tools import load_sql_from_file, ensure_non_empty import os base_model = get_model() +SQL_TO_NATURAL_PROMPT = """ +You are an expert $database_name database analyst and performance specialist. Your task is to translate the SQL query below into a detailed, precise natural-language description that another agent will later use to reconstruct and optimize the query. + +$database_name SQL Query: +```sql +$query +``` + +Your explanation must follow these requirements: + +1. **Describe the overall purpose** +- Explain clearly what the query is intended to accomplish and why (retrieve data, update rows, aggregate information, validate existence, create structures, etc.). + +2. **List ALL involved database objects** +Explicitly list every: +- Table +- View +- CTE (Common Table Expression) +- Subquery or derived table +- Function +- Stored procedure, if referenced +- Temporary table +- Schema-qualified object +Use the exact names as they appear in the query. + +3. **Describe all essential operations** +Explicitly state, using exact column names: +- Columns retrieved or modified +- Join types, join conditions, and which objects participate +- Filters and conditions (WHERE, boolean logic, comparisons) +- Aggregations (SUM, COUNT, etc.) +- Grouping and HAVING clauses +- Sorting (ORDER BY) +- Window functions +- DISTINCT, TOP, LIMIT, OFFSET, pagination +- Any $database_name-specific features used$specific_features + +4. **Maintain strict factual accuracy** +- Do NOT infer business meaning unless directly implied. +- Do NOT rename or paraphrase column names; repeat them exactly. + +5. **Use clear, structured natural language** +- Provide a step-by-step explanation that makes every operation and purpose explicit. +- The output must be complete enough that the query can be reconstructed. + +6. **⚠️ CRITICAL: Identify Performance Issues** +Flag any of these CRITICAL performance problems found in the query: +- **NO WHERE CLAUSE** (BE CAREFUL - AVOID FALSE POSITIVES): + * ONLY flag if the MAIN/OUTER SELECT has absolutely NO WHERE keyword with filtering conditions + * If query HAS 'WHERE' followed by conditions (even old-style JOINs in WHERE), DO NOT flag + * Subqueries/EXISTS having WHERE does NOT mean main query has no WHERE + * CROSS APPLY/LATERAL with internal WHERE counts as filtered + * If truly no WHERE: Flag as CRITICAL (causes FULL TABLE SCAN, no predicate pushdown) +- **Non-SARGable patterns**: Functions on indexed columns in WHERE/JOIN (e.g., YEAR(date), UPPER(col)) +- **Leading wildcards**: LIKE '%value%' patterns that prevent index usage +- **Implicit conversions**: Type mismatches in comparisons +- **NOLOCK/WITH (NOLOCK) hints**: If query uses WITH (NOLOCK), WITH (nolock), WITH(NOLOCK), (NOLOCK), (nolock) or NOLOCK/nolock (any case) → DO NOT REMOVE, but FLAG as **CRITICAL RISK**: "⚠️ WITH (NOLOCK) reads uncommitted/dirty data - CRITICAL: may cause INCORRECT FINANCIAL VALUES and data inconsistencies in production" +$analysis_requirements + +Explanation: +""".strip() + _db_path = os.getenv("SQL_OPT_TEAM_DB_FILE", "tmp/sql_optimizer_team.db") _debug_mode = os.getenv("SQL_OPT_TEAM_DEBUG_MODE", "false").strip().lower() in {"1", "true", "yes", "on"} db = SqliteDb(db_file=_db_path) + +async def explain_query_tool( + database_type: str, + sql: str, + provider: str | None = None, + model: str | None = None, + temperature: float | None = None, + max_tokens: int | None = None, + api_key: str | None = None, +) -> dict[str, str]: + from sql_optimizer_team.tools.core_tools import explain_query_core + + return await explain_query_core( + database_type=database_type, + sql=sql, + provider=provider, + model=model, + temperature=temperature, + max_tokens=max_tokens, + api_key=api_key, + ) + sql_analyst_agent = Agent( name="SQL Analyst", role=( @@ -20,7 +103,7 @@ sql_analyst_agent = Agent( "A saída deve seguir exatamente a prompt original (SQL → natural) do projeto oracle-sql-query-optimizer." ), model=base_model, - tools=[explain_query_core, load_sql_from_file, ensure_non_empty, supported_databases], + tools=[explain_query_tool, load_sql_from_file, ensure_non_empty, supported_databases], markdown=True, add_history_to_context=True, db=db, @@ -32,67 +115,7 @@ sql_analyst_agent = Agent( "- Se o usuário fornecer um caminho de arquivo, use load_sql_from_file().", "- Preferência: use explain_query_core(database_type, sql) para gerar a explicação via core de negócio.", "- Use a template oficial abaixo para estruturar a explicação (SQL → natural).", - """ - You are an expert $database_name database analyst and performance specialist. Your task is to translate the SQL query below into a detailed, precise natural-language description that another agent will later use to reconstruct and optimize the query. - - $database_name SQL Query: - ```sql - $query - ``` - - Your explanation must follow these requirements: - - 1. **Describe the overall purpose** - - Explain clearly what the query is intended to accomplish and why (retrieve data, update rows, aggregate information, validate existence, create structures, etc.). - - 2. **List ALL involved database objects** - Explicitly list every: - - Table - - View - - CTE (Common Table Expression) - - Subquery or derived table - - Function - - Stored procedure, if referenced - - Temporary table - - Schema-qualified object - Use the exact names as they appear in the query. - - 3. **Describe all essential operations** - Explicitly state, using exact column names: - - Columns retrieved or modified - - Join types, join conditions, and which objects participate - - Filters and conditions (WHERE, boolean logic, comparisons) - - Aggregations (SUM, COUNT, etc.) - - Grouping and HAVING clauses - - Sorting (ORDER BY) - - Window functions - - DISTINCT, TOP, LIMIT, OFFSET, pagination - - Any $database_name-specific features used$specific_features - - 4. **Maintain strict factual accuracy** - - Do NOT infer business meaning unless directly implied. - - Do NOT rename or paraphrase column names; repeat them exactly. - - 5. **Use clear, structured natural language** - - Provide a step-by-step explanation that makes every operation and purpose explicit. - - The output must be complete enough that the query can be reconstructed. - - 6. **⚠️ CRITICAL: Identify Performance Issues** - Flag any of these CRITICAL performance problems found in the query: - - **NO WHERE CLAUSE** (BE CAREFUL - AVOID FALSE POSITIVES): - * ONLY flag if the MAIN/OUTER SELECT has absolutely NO WHERE keyword with filtering conditions - * If query HAS 'WHERE' followed by conditions (even old-style JOINs in WHERE), DO NOT flag - * Subqueries/EXISTS having WHERE does NOT mean main query has no WHERE - * CROSS APPLY/LATERAL with internal WHERE counts as filtered - * If truly no WHERE: Flag as CRITICAL (causes FULL TABLE SCAN, no predicate pushdown) - - **Non-SARGable patterns**: Functions on indexed columns in WHERE/JOIN (e.g., YEAR(date), UPPER(col)) - - **Leading wildcards**: LIKE '%value%' patterns that prevent index usage - - **Implicit conversions**: Type mismatches in comparisons - - **NOLOCK/WITH (NOLOCK) hints**: If query uses WITH (NOLOCK), WITH (nolock), WITH(NOLOCK), (NOLOCK), (nolock) or NOLOCK/nolock (any case) → DO NOT REMOVE, but FLAG as **CRITICAL RISK**: "⚠️ WITH (NOLOCK) reads uncommitted/dirty data - CRITICAL: may cause INCORRECT FINANCIAL VALUES and data inconsistencies in production" - $analysis_requirements - - Explanation: - """.strip(), + SQL_TO_NATURAL_PROMPT, "- Entregue apenas a explicação natural estruturada conforme a prompt; não reescreva a SQL.", "- Identifique problemas críticos de performance conforme a prompt.", ], diff --git a/src/sql_optimizer_team/agents/sql_optimizer_agent.py b/src/sql_optimizer_team/agents/sql_optimizer_agent.py index a3dde66..b6cd22f 100644 --- a/src/sql_optimizer_team/agents/sql_optimizer_agent.py +++ b/src/sql_optimizer_team/agents/sql_optimizer_agent.py @@ -1,18 +1,104 @@ from agno.agent import Agent from agno.db.sqlite import SqliteDb from sql_optimizer_team.tools.engine.model_selector import get_model -from sql_optimizer_team.tools.core_tools import optimize_query_core from sql_optimizer_team.tools.prompt_tools import supported_databases from sql_optimizer_team.tools.sql_tools import load_sql_from_file, ensure_non_empty import os base_model = get_model() +NATURAL_TO_SQL_PROMPT = """ +You are an expert $database_name SQL developer and query performance specialist. +Your task is to write an optimized SQL query based exclusively on the natural-language description provided below. + +Description: +$explanation + +⚠️ CRITICAL RULES - READ BEFORE GENERATING SQL: + +1. **PRESERVE ALL BUSINESS LOGIC EXACTLY** +- Every CASE WHEN statement must have IDENTICAL conditions and results +- Every calculated column must use IDENTICAL formulas +- Every subquery must query the SAME tables with SAME filters +- Do NOT simplify, merge, or "improve" business logic - even if it looks redundant +- If description mentions specific conditions (cd_tp_apolice = 2, etc.), preserve them EXACTLY + +2. **PRESERVE ALL TABLES AND COLUMNS** +- Include EVERY table mentioned in the description +- Include EVERY column mentioned in the description +- Use EXACT column names as described (no renaming) +- Use EXACT table aliases as described + +3. **Translate the full described logic into SQL** +- Implement all actions, operations, filters, joins, and conditions exactly as stated. +- Use every object and column referenced in the description, using their exact names. +- If the description mentions specific filter values (e.g., cd_tipo_endosso = 0), use those EXACT values + +4. **Write optimized SQL while preserving semantics** +- Apply $database_name best practices for performance. +- Use indexing-aware filtering, efficient join strategies, and clear expressions. +- Implement aggregations, groupings, window functions, or pagination when described. +- Prefer performant constructs commonly recommended for $database_name workloads. +- OPTIMIZATION means structure/hints/indexes - NOT changing logic + +5. **Use $database_name-specific syntax and features** +- Apply native functions, operators, optimizer behaviors, or hints when appropriate. +- Incorporate $specific_requirements if provided. + +6. **Ensure logical fidelity - ZERO TOLERANCE FOR CHANGES** +- The SQL must reflect PRECISELY the behavior described +- Do NOT add logic not explicitly stated +- Do NOT omit any step described +- Do NOT infer or assume details beyond what is explicitly stated +- Do NOT "simplify" complex CASE statements +- Do NOT merge or combine separate calculated columns + +7. **Self-Verification Checklist** (perform before outputting): +- [ ] All tables from description are present in query +- [ ] All columns from description are present in SELECT +- [ ] All CASE conditions match description exactly +- [ ] All subquery filters match description exactly +- [ ] All JOIN conditions match description exactly +- [ ] No business logic was simplified or changed + +8. **Output format** +- Provide ONLY the final, optimized SQL query. +- Do NOT include explanations, comments, or extra text. + +Optimized SQL Query: +""".strip() + _db_path = os.getenv("SQL_OPT_TEAM_DB_FILE", "tmp/sql_optimizer_team.db") _debug_mode = os.getenv("SQL_OPT_TEAM_DEBUG_MODE", "false").strip().lower() in {"1", "true", "yes", "on"} db = SqliteDb(db_file=_db_path) + +async def optimize_query_tool( + database_type: str, + sql: str, + provider: str | None = None, + model: str | None = None, + temperature: float | None = None, + max_tokens: int | None = None, + api_key: str | None = None, + output_dir: str | None = None, + no_review: bool = False, +) -> dict[str, str | dict[str, str]]: + from sql_optimizer_team.tools.core_tools import optimize_query_core + + return await optimize_query_core( + database_type=database_type, + sql=sql, + provider=provider, + model=model, + temperature=temperature, + max_tokens=max_tokens, + api_key=api_key, + output_dir=output_dir, + no_review=no_review, + ) + sql_optimizer_agent = Agent( name="SQL Optimizer", role=( @@ -20,7 +106,7 @@ sql_optimizer_agent = Agent( "mantendo 100% da lógica e entregando apenas a SQL otimizada." ), model=base_model, - tools=[optimize_query_core, load_sql_from_file, ensure_non_empty, supported_databases], + tools=[optimize_query_tool, load_sql_from_file, ensure_non_empty, supported_databases], markdown=True, add_history_to_context=True, db=db, @@ -31,66 +117,7 @@ sql_optimizer_agent = Agent( "- Exija banco alvo e SQL antes de otimizar.", "- Use optimize_query_core(database_type, sql) para executar o core de negócio.", "- Use a template oficial abaixo para reescrever (natural → SQL) mantendo 100% da lógica.", - """ - You are an expert $database_name SQL developer and query performance specialist. - Your task is to write an optimized SQL query based exclusively on the natural-language description provided below. - - Description: - $explanation - - ⚠️ CRITICAL RULES - READ BEFORE GENERATING SQL: - - 1. **PRESERVE ALL BUSINESS LOGIC EXACTLY** - - Every CASE WHEN statement must have IDENTICAL conditions and results - - Every calculated column must use IDENTICAL formulas - - Every subquery must query the SAME tables with SAME filters - - Do NOT simplify, merge, or "improve" business logic - even if it looks redundant - - If description mentions specific conditions (cd_tp_apolice = 2, etc.), preserve them EXACTLY - - 2. **PRESERVE ALL TABLES AND COLUMNS** - - Include EVERY table mentioned in the description - - Include EVERY column mentioned in the description - - Use EXACT column names as described (no renaming) - - Use EXACT table aliases as described - - 3. **Translate the full described logic into SQL** - - Implement all actions, operations, filters, joins, and conditions exactly as stated. - - Use every object and column referenced in the description, using their exact names. - - If the description mentions specific filter values (e.g., cd_tipo_endosso = 0), use those EXACT values - - 4. **Write optimized SQL while preserving semantics** - - Apply $database_name best practices for performance. - - Use indexing-aware filtering, efficient join strategies, and clear expressions. - - Implement aggregations, groupings, window functions, or pagination when described. - - Prefer performant constructs commonly recommended for $database_name workloads. - - OPTIMIZATION means structure/hints/indexes - NOT changing logic - - 5. **Use $database_name-specific syntax and features** - - Apply native functions, operators, optimizer behaviors, or hints when appropriate. - - Incorporate $specific_requirements if provided. - - 6. **Ensure logical fidelity - ZERO TOLERANCE FOR CHANGES** - - The SQL must reflect PRECISELY the behavior described - - Do NOT add logic not explicitly stated - - Do NOT omit any step described - - Do NOT infer or assume details beyond what is explicitly stated - - Do NOT "simplify" complex CASE statements - - Do NOT merge or combine separate calculated columns - - 7. **Self-Verification Checklist** (perform before outputting): - - [ ] All tables from description are present in query - - [ ] All columns from description are present in SELECT - - [ ] All CASE conditions match description exactly - - [ ] All subquery filters match description exactly - - [ ] All JOIN conditions match description exactly - - [ ] No business logic was simplified or changed - - 8. **Output format** - - Provide ONLY the final, optimized SQL query. - - Do NOT include explanations, comments, or extra text. - - Optimized SQL Query: - """.strip(), + NATURAL_TO_SQL_PROMPT, "- Extraia e devolva SOMENTE optimized_query (sem explicações, sem markdown).", "- Preserve 100% da lógica, colunas, aliases, filtros, joins e subqueries.", ], diff --git a/src/sql_optimizer_team/knowledge/__init__.py b/src/sql_optimizer_team/knowledge/__init__.py new file mode 100644 index 0000000..85192c6 --- /dev/null +++ b/src/sql_optimizer_team/knowledge/__init__.py @@ -0,0 +1,5 @@ +"""Internal knowledge base helpers.""" + +from sql_optimizer_team.knowledge.internal_kb import build_internal_knowledge, attach_internal_knowledge + +__all__ = ["build_internal_knowledge", "attach_internal_knowledge"] diff --git a/src/sql_optimizer_team/knowledge/internal_kb.py b/src/sql_optimizer_team/knowledge/internal_kb.py new file mode 100644 index 0000000..f234e7a --- /dev/null +++ b/src/sql_optimizer_team/knowledge/internal_kb.py @@ -0,0 +1,100 @@ +"""Internal KB (RAG) setup for the SQL optimizer team.""" + +from __future__ import annotations + +from dataclasses import dataclass +from pathlib import Path +import os + +from agno.db.sqlite import SqliteDb +from agno.knowledge.knowledge import Knowledge +from agno.knowledge.embedder.sentence_transformer import SentenceTransformerEmbedder +from agno.vectordb.chroma import ChromaDb + +from sql_optimizer_team.tools.engine.config.logger import get_logger + +logger = get_logger(__name__) + + +@dataclass(frozen=True) +class InternalKBConfig: + kb_path: Path + chroma_path: Path + embedder_id: str + contents_db_file: Path + block_external: bool + + +def _load_config() -> InternalKBConfig: + kb_path = Path(os.getenv("SQL_OPT_KB_PATH", "kb")).resolve() + chroma_path = Path(os.getenv("SQL_OPT_KB_CHROMA_PATH", "tmp/kb_chroma")).resolve() + embedder_id = os.getenv( + "SQL_OPT_KB_EMBEDDER_ID", + "sentence-transformers/all-MiniLM-L6-v2", + ).strip() + contents_db_file = Path(os.getenv("SQL_OPT_KB_DB_FILE", "tmp/sql_optimizer_kb.db")).resolve() + block_external = os.getenv("SQL_OPT_BLOCK_EXTERNAL_TOOLS", "true").strip().lower() in {"1", "true", "yes", "on"} + return InternalKBConfig( + kb_path=kb_path, + chroma_path=chroma_path, + embedder_id=embedder_id, + contents_db_file=contents_db_file, + block_external=block_external, + ) + + +def build_internal_knowledge() -> Knowledge: + config = _load_config() + + if config.block_external: + logger.info("External tools blocked for KB", kb_path=str(config.kb_path)) + + embedder = SentenceTransformerEmbedder(id=config.embedder_id) + vector_db = ChromaDb( + name="sql-optimizer-kb", + path=str(config.chroma_path), + persistent_client=True, + embedder=embedder, + ) + contents_db = SqliteDb(db_file=str(config.contents_db_file)) + + knowledge = Knowledge( + name="internal-sql-kb", + description="Base de conhecimento interna para otimização de SQL", + vector_db=vector_db, + contents_db=contents_db, + max_results=6, + ) + + if not config.kb_path.exists(): + logger.warning("KB path not found; skipping ingest", kb_path=str(config.kb_path)) + return knowledge + + if config.block_external and not config.kb_path.is_dir(): + logger.warning("KB path is not a directory; skipping ingest", kb_path=str(config.kb_path)) + return knowledge + + try: + knowledge.insert( + path=str(config.kb_path), + include=["**/*.md", "**/*.txt", "**/*.sql", "**/*.pdf"], + exclude=["**/.git/**", "**/.venv/**", "**/__pycache__/**"], + upsert=True, + skip_if_exists=True, + ) + logger.info("KB ingest complete", kb_path=str(config.kb_path)) + except Exception as exc: + logger.error("KB ingest failed", error=str(exc)) + + return knowledge + + +def attach_internal_knowledge(knowledge: Knowledge, *agents: object) -> None: + for agent in agents: + try: + setattr(agent, "knowledge", knowledge) + setattr(agent, "add_knowledge_to_context", True) + setattr(agent, "search_knowledge", True) + setattr(agent, "update_knowledge", False) + except Exception as exc: + logger.warning("Failed to attach knowledge", agent=str(agent), error=str(exc)) diff --git a/src/sql_optimizer_team/team_app.py b/src/sql_optimizer_team/team_app.py index 966757f..94c9ece 100644 --- a/src/sql_optimizer_team/team_app.py +++ b/src/sql_optimizer_team/team_app.py @@ -2,13 +2,12 @@ from agno.team.team import Team from agno.os.app import AgentOS from agno.db.sqlite import SqliteDb from dotenv import load_dotenv +from sql_optimizer_team.knowledge import build_internal_knowledge, attach_internal_knowledge from sql_optimizer_team.tools.engine.model_selector import get_model -from sql_optimizer_team.agents import ( - sql_analyst_agent, - sql_optimizer_agent, - sql_quality_agent, - conservative_analysis_agent, -) +from sql_optimizer_team.agents.sql_analyst_agent import sql_analyst_agent +from sql_optimizer_team.agents.sql_optimizer_agent import sql_optimizer_agent +from sql_optimizer_team.agents.sql_quality_agent import sql_quality_agent +from sql_optimizer_team.agents.conservative_analysis_agent import conservative_analysis_agent import os load_dotenv() @@ -20,6 +19,39 @@ _debug_mode = os.getenv("SQL_OPT_TEAM_DEBUG_MODE", "false").strip().lower() in { db = SqliteDb(db_file=_db_path) +_kb = build_internal_knowledge() +attach_internal_knowledge( + _kb, + sql_analyst_agent, + sql_optimizer_agent, + sql_quality_agent, + conservative_analysis_agent, +) + +_block_external_tools = os.getenv("SQL_OPT_BLOCK_EXTERNAL_TOOLS", "true").strip().lower() in {"1", "true", "yes", "on"} +if _block_external_tools: + _allowed_tool_names = { + "explain_query_tool", + "optimize_query_tool", + "load_sql_from_file", + "ensure_non_empty", + "supported_databases", + "diff_sql", + } + + def _filter_tools(agent) -> None: + if not getattr(agent, "tools", None): + return + filtered = [] + for tool in agent.tools: + name = getattr(tool, "__name__", None) or getattr(tool, "name", None) or str(tool) + if name in _allowed_tool_names: + filtered.append(tool) + agent.tools = filtered + + for _agent in [sql_analyst_agent, sql_optimizer_agent, sql_quality_agent, conservative_analysis_agent]: + _filter_tools(_agent) + sql_optimizer_team = Team( name="SQL Optimization Team", model=base_model, diff --git a/src/sql_optimizer_team/tools/engine/llm_tools/agno_tool.py b/src/sql_optimizer_team/tools/engine/llm_tools/agno_tool.py index 98212d5..0a0b994 100644 --- a/src/sql_optimizer_team/tools/engine/llm_tools/agno_tool.py +++ b/src/sql_optimizer_team/tools/engine/llm_tools/agno_tool.py @@ -41,6 +41,7 @@ class AgnoLLMTool(BaseLLMTool): result_text = self._extract_text(response) validated = self._validate_response(result_text) self._log_response(validated) + self._log_usage_from_response(response, prompt, validated) return validated except Exception as e: self._log_error(e) diff --git a/src/sql_optimizer_team/tools/engine/llm_tools/base_tool.py b/src/sql_optimizer_team/tools/engine/llm_tools/base_tool.py index 748a975..d1c998f 100644 --- a/src/sql_optimizer_team/tools/engine/llm_tools/base_tool.py +++ b/src/sql_optimizer_team/tools/engine/llm_tools/base_tool.py @@ -5,6 +5,8 @@ This module provides a base class with common functionality for all LLM tools. from abc import ABC from typing import Any +import math +import os from sql_optimizer_team.tools.engine.tools_api.llm_tool import LLMTool from sql_optimizer_team.tools.engine.types.tool_exceptions import LLMProviderError @@ -91,6 +93,49 @@ class BaseLLMTool(LLMTool, ABC): **kwargs, ) + def _estimate_tokens(self, text: str) -> int: + """Best-effort token estimate when provider usage is unavailable.""" + if not text: + return 0 + return max(1, math.ceil(len(text) / 4)) + + def _log_usage_from_response(self, response_obj: Any, prompt: str, response_text: str) -> None: + """Log token usage and cost if enabled. + + Reads usage from ModelResponse when available, otherwise uses a rough estimate. + Cost is computed using env vars LLM_COST_INPUT_PER_1K and LLM_COST_OUTPUT_PER_1K. + """ + enabled = os.getenv("LLM_LOG_USAGE", "true").strip().lower() in {"1", "true", "yes", "on"} + if not enabled: + return + + input_tokens = getattr(response_obj, "input_tokens", None) + output_tokens = getattr(response_obj, "output_tokens", None) + total_tokens = getattr(response_obj, "total_tokens", None) + + if input_tokens is None: + input_tokens = self._estimate_tokens(prompt) + if output_tokens is None: + output_tokens = self._estimate_tokens(response_text) + if total_tokens is None and input_tokens is not None and output_tokens is not None: + total_tokens = input_tokens + output_tokens + + cost_in = float(os.getenv("LLM_COST_INPUT_PER_1K", "0") or 0) + cost_out = float(os.getenv("LLM_COST_OUTPUT_PER_1K", "0") or 0) + cost_usd = None + if input_tokens is not None or output_tokens is not None: + cost_usd = (input_tokens or 0) * cost_in / 1000 + (output_tokens or 0) * cost_out / 1000 + + logger.info( + "LLM usage", + provider=self.provider_name, + model=self._model_name, + input_tokens=input_tokens, + output_tokens=output_tokens, + total_tokens=total_tokens, + cost_usd=cost_usd, + ) + def _log_error(self, error: Exception, **kwargs: Any) -> None: """Log LLM error. diff --git a/src/sql_optimizer_team/tools/engine/prompt_tools/base_generator.py b/src/sql_optimizer_team/tools/engine/prompt_tools/base_generator.py index 46ef151..e18296f 100644 --- a/src/sql_optimizer_team/tools/engine/prompt_tools/base_generator.py +++ b/src/sql_optimizer_team/tools/engine/prompt_tools/base_generator.py @@ -7,185 +7,17 @@ reducing code duplication and ensuring consistency. from abc import ABC, abstractmethod from string import Template +import importlib from sql_optimizer_team.tools.engine.tools_api.prompt_tool import PromptGeneratorTool -SQL_TO_NATURAL_TEMPLATE = Template(""" - You are an expert $database_name database analyst and performance specialist. Your task is to translate the SQL query below into a detailed, precise natural-language description that another agent will later use to reconstruct and optimize the query. - - $database_name SQL Query: - ```sql - $query - ``` - - Your explanation must follow these requirements: - - 1. **Describe the overall purpose** - - Explain clearly what the query is intended to accomplish and why (retrieve data, update rows, aggregate information, validate existence, create structures, etc.). - - 2. **List ALL involved database objects** - Explicitly list every: - - Table - - View - - CTE (Common Table Expression) - - Subquery or derived table - - Function - - Stored procedure, if referenced - - Temporary table - - Schema-qualified object - Use the exact names as they appear in the query. - - 3. **Describe all essential operations** - Explicitly state, using exact column names: - - Columns retrieved or modified - - Join types, join conditions, and which objects participate - - Filters and conditions (WHERE, boolean logic, comparisons) - - Aggregations (SUM, COUNT, etc.) - - Grouping and HAVING clauses - - Sorting (ORDER BY) - - Window functions - - DISTINCT, TOP, LIMIT, OFFSET, pagination - - Any $database_name-specific features used$specific_features - - 4. **Maintain strict factual accuracy** - - Do NOT infer business meaning unless directly implied. - - Do NOT rename or paraphrase column names; repeat them exactly. - - 5. **Use clear, structured natural language** - - Provide a step-by-step explanation that makes every operation and purpose explicit. - - The output must be complete enough that the query can be reconstructed. - - 6. **⚠️ CRITICAL: Identify Performance Issues** - Flag any of these CRITICAL performance problems found in the query: - - **NO WHERE CLAUSE** (BE CAREFUL - AVOID FALSE POSITIVES): - * ONLY flag if the MAIN/OUTER SELECT has absolutely NO WHERE keyword with filtering conditions - * If query HAS 'WHERE' followed by conditions (even old-style JOINs in WHERE), DO NOT flag - * Subqueries/EXISTS having WHERE does NOT mean main query has no WHERE - * CROSS APPLY/LATERAL with internal WHERE counts as filtered - * If truly no WHERE: Flag as CRITICAL (causes FULL TABLE SCAN, no predicate pushdown) - - **Non-SARGable patterns**: Functions on indexed columns in WHERE/JOIN (e.g., YEAR(date), UPPER(col)) - - **Leading wildcards**: LIKE '%value%' patterns that prevent index usage - - **Implicit conversions**: Type mismatches in comparisons - - **NOLOCK/WITH (NOLOCK) hints**: If query uses WITH (NOLOCK), WITH (nolock), WITH(NOLOCK), (NOLOCK), (nolock) or NOLOCK/nolock (any case) → DO NOT REMOVE, but FLAG as **CRITICAL RISK**: "⚠️ WITH (NOLOCK) reads uncommitted/dirty data - CRITICAL: may cause INCORRECT FINANCIAL VALUES and data inconsistencies in production" - $analysis_requirements - - Explanation: -""") - -NATURAL_TO_SQL_TEMPLATE = Template(""" - You are an expert $database_name SQL developer and query performance specialist. - Your task is to write an optimized SQL query based exclusively on the natural-language description provided below. - - Description: - $explanation - - ⚠️ CRITICAL RULES - READ BEFORE GENERATING SQL: - - 1. **PRESERVE ALL BUSINESS LOGIC EXACTLY** - - Every CASE WHEN statement must have IDENTICAL conditions and results - - Every calculated column must use IDENTICAL formulas - - Every subquery must query the SAME tables with SAME filters - - Do NOT simplify, merge, or "improve" business logic - even if it looks redundant - - If description mentions specific conditions (cd_tp_apolice = 2, etc.), preserve them EXACTLY - - 2. **PRESERVE ALL TABLES AND COLUMNS** - - Include EVERY table mentioned in the description - - Include EVERY column mentioned in the description - - Use EXACT column names as described (no renaming) - - Use EXACT table aliases as described - - 3. **Translate the full described logic into SQL** - - Implement all actions, operations, filters, joins, and conditions exactly as stated. - - Use every object and column referenced in the description, using their exact names. - - If the description mentions specific filter values (e.g., cd_tipo_endosso = 0), use those EXACT values - - 4. **Write optimized SQL while preserving semantics** - - Apply $database_name best practices for performance. - - Use indexing-aware filtering, efficient join strategies, and clear expressions. - - Implement aggregations, groupings, window functions, or pagination when described. - - Prefer performant constructs commonly recommended for $database_name workloads. - - OPTIMIZATION means structure/hints/indexes - NOT changing logic - - 5. **Use $database_name-specific syntax and features** - - Apply native functions, operators, optimizer behaviors, or hints when appropriate. - - Incorporate $specific_requirements if provided. - - 6. **Ensure logical fidelity - ZERO TOLERANCE FOR CHANGES** - - The SQL must reflect PRECISELY the behavior described - - Do NOT add logic not explicitly stated - - Do NOT omit any step described - - Do NOT infer or assume details beyond what is explicitly stated - - Do NOT "simplify" complex CASE statements - - Do NOT merge or combine separate calculated columns - - 7. **Self-Verification Checklist** (perform before outputting): - - [ ] All tables from description are present in query - - [ ] All columns from description are present in SELECT - - [ ] All CASE conditions match description exactly - - [ ] All subquery filters match description exactly - - [ ] All JOIN conditions match description exactly - - [ ] No business logic was simplified or changed - - 8. **Output format** - - Provide ONLY the final, optimized SQL query. - - Do NOT include explanations, comments, or extra text. - - Optimized SQL Query: -""") - -CONSERVATIVE_ANALYSIS_TEMPLATE = Template(""" - You are an expert $database_name database analyst and performance specialist. - - Your task is to ANALYZE the SQL query below and provide SUGGESTIONS for improvement. - - ⚠️ CRITICAL: You must NOT rewrite or modify the query. Only provide analysis and suggestions. - - $database_name SQL Query: - ```sql - $query - ``` - - Query Complexity Information: - - Columns: $column_count - - Tables: $table_count - - Subqueries: $subquery_count - - CASE statements: $case_count - - JOINs: $join_count - - Complexity Level: $complexity_level - - Provide your analysis in the following structured format: - - ## PERFORMANCE ISSUES - List each performance issue found, with severity (CRITICAL/HIGH/MEDIUM/LOW): - - [SEVERITY] Issue description - - [SEVERITY] Issue description - - ## SUGGESTED INDEXES - List indexes that could improve this query: - - CREATE INDEX idx_name ON table(columns) -- Reason - - ## OPTIMIZATION SUGGESTIONS - List specific suggestions WITHOUT rewriting the query: - - Suggestion 1: Description of what could be improved and why - - Suggestion 2: Description of what could be improved and why - - ## RISK ASSESSMENT - - WITH (NOLOCK) usage: [Yes/No] - If yes, explain the risks - - Missing WHERE clause: [Yes/No] - If yes, explain the impact - - Implicit conversions: [Yes/No] - If yes, list them - - ## SUMMARY - Brief summary of the most important findings and priority order for addressing them. - - Remember: DO NOT provide a rewritten query. Only analysis and suggestions. -""") - - def _render_sql_to_natural( database_name: str, query: str, specific_features: str = "", analysis_requirements: str = "" ) -> str: - return SQL_TO_NATURAL_TEMPLATE.substitute( + module = importlib.import_module("sql_optimizer_team.agents.sql_analyst_agent") + template_text = getattr(module, "SQL_TO_NATURAL_PROMPT") + return Template(template_text).substitute( database_name=database_name, query=query, specific_features=f"\n{specific_features}" if specific_features else "", @@ -196,7 +28,9 @@ def _render_sql_to_natural( def _render_natural_to_sql( database_name: str, explanation: str, specific_requirements: str ) -> str: - return NATURAL_TO_SQL_TEMPLATE.substitute( + module = importlib.import_module("sql_optimizer_team.agents.sql_optimizer_agent") + template_text = getattr(module, "NATURAL_TO_SQL_PROMPT") + return Template(template_text).substitute( database_name=database_name, explanation=explanation, specific_requirements="\n".join( @@ -215,7 +49,9 @@ def _render_conservative_analysis( join_count: int = 0, complexity_level: str = "unknown", ) -> str: - return CONSERVATIVE_ANALYSIS_TEMPLATE.substitute( + module = importlib.import_module("sql_optimizer_team.agents.conservative_analysis_agent") + template_text = getattr(module, "CONSERVATIVE_ANALYSIS_PROMPT") + return Template(template_text).substitute( database_name=database_name, query=query, column_count=column_count,