ai-alignment

Here are 37 public repositories matching this topic...

MinghuiChen43 / awesome-trustworthy-deep-learning

A curated list of trustworthy deep learning papers. Daily updating...

Updated Dec 20, 2024

agencyenterprise / PromptInject

PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022

machine-learning agi language-models ai-safety adversarial-attacks ai-alignment ml-safety gpt-3 large-language-models prompt-engineering chain-of-thought agi-alignment

Updated Feb 26, 2024
Python

tomekkorbak / pretraining-with-human-feedback

Star

Code accompanying the paper Pretraining Language Models with Human Preferences

reinforcement-learning gpt language-models ai-safety ai-alignment pretraining decision-transformers rlhf

Updated Feb 13, 2024
Python

lets-make-safe-ai / make-safe-ai

Star

How to Make Safe AI? Let's Discuss! 💡|💬|🙌|📚

ai agi artificial-intelligence artificial-general-intelligence ai-safety ai-alignment

Updated Mar 29, 2023

Giskard-AI / awesome-ai-safety

Sponsor

Star

📚 A curated list of papers & technical articles on AI Quality & Safety

Updated Oct 13, 2023

tsinghua-fib-lab / AAAI2025_MIA-Tuner

Star

[AAAI'25 Oral] "MIA-Tuner: Adapting Large Language Models as Pre-training Text Detector".

ai-alignment membership-inference-attack large-language-models pretraining-data-detection

Updated Dec 16, 2024
Python

EzgiKorkmaz / adversarial-reinforcement-learning

Star

Reading list for adversarial perspective and robustness in deep reinforcement learning.

deep-reinforcement-learning ai-safety adversarial-machine-learning multiagent-reinforcement-learning robust-machine-learning ai-alignment safe-reinforcement-learning robust-reinforcement-learning responsible-ai adversarial-reinforcement-learning meta-reinforcement-learning explainable-machine-learning adversarial-policies safe-rlhf machine-learning-safety reinforcement-learning-safety artificial-intelligence-alignment reinforcement-learning-alignment robust-deep-reinforcement-learning

Updated Jun 18, 2024

dit7ya / awesome-ai-alignment

Star

A curated list of awesome resources for Artificial Intelligence Alignment research

awesome awesome-list ai-safety ai-alignment

Updated Jul 14, 2023

AthenaCore / AwesomeResponsibleAI

Star

A curated list of awesome academic research, books, code of ethics, data sets, institutes, newsletters, principles, podcasts, reports, tools, regulations and standards related to Responsible, Trustworthy, and Human-Centered AI.

ai awesome-list ai-safety interpretable-ai explainable-ai xai ai-alignment fairness-ai responsible-ai ethical-ai trustworthy-ai ai-regulation ai-governance ai-standards

Updated Dec 23, 2024

RLHFlow / Directional-Preference-Alignment

Star

Directional Preference Alignment

ai-alignment large-language-models rlhf

Updated Sep 23, 2024

wesg52 / sparse-probing-paper

Star

Sparse probing paper full code.

ai-safety interpretability ai-alignment mechanistic-interpretability

Updated Dec 17, 2023
Jupyter Notebook

riceissa / aiwatch

Star

Website to track people, organizations, and products (tools, websites, etc.) in AI safety

mysql php database dataset ai-safety data-portal aisafety ai-alignment

Updated Dec 23, 2024
HTML

UCSC-VLAA / Sight-Beyond-Text

Star

[TMLR 2024] Official implementation of "Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics"

alignment vlm ai-alignment vision-language vicuna llm mllm llava llama2

Updated Sep 15, 2023
Python

liondw / Signal-Alignment

Star

An initiative to create concise and widely shareable educational resources, infographics, and animated explainers on the latest contributions to the community AI alignment effort. Boosting the signal and moving the community towards finding and building solutions.

education design ai ai-alignment

Updated Jul 9, 2023

phelps-sg / llm-cooperation

Sponsor

Star

Code and materials for the paper S. Phelps and Y. I. Russell, Investigating Emergent Goal-Like Behaviour in Large Language Models Using Experimental Economics, working paper, arXiv:2305.07970, May 2023

economics ai-safety gametheory experimental-economics behavioral-economics prisoners-dilemma ai-alignment experimental-psychology social-dilemmas gpt-3 gpt-4 llm principal-agent-problem

Updated Dec 11, 2024
Python

IQTLabs / daisybell

Star

Scan your AI/ML models for problems before you put them into production.

cybersecurity ai-safety bias-correction bias-detection ai-alignment model-poison ai-assurance

Updated Dec 19, 2024
Python

ai-fail-safe / safe-reward

Star

a prototype for an AI safety library that allows an agent to maximize its reward by solving a puzzle in order to prevent the worst-case outcomes of perverse instantiation

failsafe ai-safety anomaly-detection ai-alignment fail-safe

Updated Nov 8, 2022
Python

rmoehn / farlamp

Star

IDA with RL and overseer failures

ida research-project ai-alignment

Updated Jul 31, 2021
TeX

rmoehn / amplification

Star

An implementation of iterated distillation and amplification

transformer ida supervised-learning ai-safety ai-alignment

Updated Jun 22, 2022
Python

lzzcd001 / nabla-gfn

Star

Official Implementation of Nabla-GFlowNet

generative-model ai-alignment diffusion-models gflownet

Updated Dec 9, 2024

Improve this page

Add a description, image, and links to the ai-alignment topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-alignment topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-alignment

Here are 37 public repositories matching this topic...

MinghuiChen43 / awesome-trustworthy-deep-learning

agencyenterprise / PromptInject

tomekkorbak / pretraining-with-human-feedback

lets-make-safe-ai / make-safe-ai

Giskard-AI / awesome-ai-safety

tsinghua-fib-lab / AAAI2025_MIA-Tuner

EzgiKorkmaz / adversarial-reinforcement-learning

dit7ya / awesome-ai-alignment

AthenaCore / AwesomeResponsibleAI

RLHFlow / Directional-Preference-Alignment

wesg52 / sparse-probing-paper

riceissa / aiwatch

UCSC-VLAA / Sight-Beyond-Text

liondw / Signal-Alignment

phelps-sg / llm-cooperation

IQTLabs / daisybell

ai-fail-safe / safe-reward

rmoehn / farlamp

rmoehn / amplification

lzzcd001 / nabla-gfn

Improve this page

Add this topic to your repo