New method lets DeepSeek and other models answer ‘sensitive’ questions

by | Apr 17, 2025 | Technology

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

It is tough to remove bias, and in some cases, outright censorship, in large language models (LLMs). One such model, DeepSeek from China, alarmed politicians and some business leaders about its potential danger to national security. 

A select committee at the U.S. Congress recently released a report called DeepSeek, “a profound threat to our nation’s security,” and detailed policy recommendations. 

While there are ways to bypass bias through Reinforcement Learning from Human Feedback (RLHF) and fine-tuning, the enterprise risk management startup CTGT claims to have an alternative approach. CTGT developed a method that bypasses bias and censorship baked into some language models that it says 100% removes censorship.

In a paper, Cyril Gorlla and Trevor Tuttle of CTGT said that their framework “directly locates and modifies the internal features responsible for censorship.”

“This approach is not only computationally efficient but also allows fine-grained control over model behavior, ensuring that uncensored responses are delivered without compromising the model’s overall capabilities and factual accuracy,” the paper said. 

While the method was developed explicitly with DeepSeek-R1-Distill-Llama-70B in mind, the same process can be used on other models. 

“We have tested CTGT with other open weights models such as Llama and found it to be just as effective,” Gorlla told VentureBeat in an email. “Our technology operates at the foundational neural network level, meaning it applies to all deep learning models. We’re working with a leading foundation model lab to ensure their new models are trustworthy and safe from the core.”

How it works

The researchers said their method identifies features with a high likelihood of being associated with unwanted behaviors. 

“The key idea is that within a large language model, there exist latent variables (neurons or directions in the hidden state) that correspond to concepts like ‘censorship trigger’ or ‘toxic sentiment’. If we can find those variables, we can directly manipulate them,” Gorlla and Tuttle wrote. 

CTGT said there are three key steps:

Feature identification

Feature isolation and characterization

Dynamic feature modification. 

The researchers make a series of prompts that could trigger one of those “toxic sentiments.” For example, they may ask for more information about Tiananmen Square or request tips to bypass firewalls. Based on the responses, they run the prompts and establish a pattern and find vectors where the model decides to censor information. 

Once these are identified, the researchers can isolate that feature and figure out which part of the unwanted behavior it controls. Behavior may include responding more cautiously or refusing to respond altogethe …

Article Attribution | Read More at Article Source