Keeping secrets out of logs (2024)

The Hacker News discussion revolves around the critical issue of sensitive data, particularly secrets like passwords and API keys, inadvertently ending up in logs, and the various strategies and challenges associated with preventing and mitigating this.

The Pervasive Nature of Secrets in Logs

A central theme is the commonality and often unexpected ways sensitive information can leak into logs. Users shared various scenarios and acknowledged how easily this can happen, even with well-intentioned developers.

"A lot of these problems come from architectures where secrets go over the wire instead of just using signatures/ids."
"But Brian's phone number stashed in an innocuous case metadata field. Gaah!"
"New hires out of college/bootcamps often have no awareness of the risks here at all. Sometimes even engineers with years of experience but no operational mentorship in their career."
"The kitchen sink example in particular is one that trips up people. Without knowing the specifics of how a library may deal with failure edge cases, it can catch you off guard (e.g., axios errors including API key headers)."
"Stack traces, too. I did some work with a heavy Java shop and pretty much everything sensitive ended up in a stack trace at some point."
"I think the big problem is when secrets can be anywhere in a string and you don't control the input (e.g, library stacktraces, HTTP responses, JSON that was stringified)."

The Difficulty of Universal Secret Detection and Redaction

Technical solutions for preventing secrets in logs are often discussed, but the complexity and limitations of these methods are also highlighted.

"That presumes you know all secrets ahead of time. A risk in and of itself. But from a practical point of view you will never know all secrets, because they are generated constantly in real time."
"Regexes are fine for SSNs and the like, but to be really effective, one would need a full-on Named Entity Recognition in the pipeline, perhaps just as a canary."
"And an exact match is just part of the question; if a dev redacts the end and another dev redacts the start, you can still reassemble the secret with enough logs."
"blkhawk: oh god – I had that come up in an issue at work just about a month ago. A development system used really simple usernames and passwords since it was just for testing but all the lines with one of those got gobbled up because they had 'secrets' in them."
"Especially since the ability of lines getting censored even when the secrets were just part of words showed that probably no hashing was involved."

Layered Security and Defense in Depth Strategies

A significant portion of the discussion emphasizes that no single solution is foolproof and that a multi-layered approach, often referred to as "defense in depth," is crucial.

"My argument is that generally everyone has access to all the logs. If you restrict the access and add guardrails around it, you can minimize the surface area and also ways it can be leaked out."
"If you take a defensive approach towards, you have to assume that some secret is getting logged somewhere. The goal then becomes a way to reduce the surface area or blast radius of this possible leakage."
"Yes, but think defense in depth. Your team member who leaves for a competitor could tell them your peak usage hours, but he shouldn't be able to tell them all your customers' passwords."
"You could have 100s of people who have a business need to look at syslog from a router, but approximately nobody who should have access to login creds of administrative users and maybe 10s of people with access to automation role account creds."
"Dataflow analysis and control applies in a BIG way, e.g., separating an audit log for forensics, where you really NEED the PII, from a technical log which the SREs can dig into without being suspected of stealing sensitive info. Start there."

The "Who Should See What" Problem (Access Control for Logs)

Beyond preventing secrets from entering logs, controlling who can access logs and which logs they can access is identified as a critical security measure.

"I think secrets ending up in the log is an issue but who should have access to view logs of what log should also be an important that is often ignored. This is also scope down the surface area of leakage."
"Logs probably need to be exposed to support teams, oncalls for sister-teams (if you are a large org), all your devs etc. That is many MANY more people than need access to secrets."
"Secrets in logs therefore puts you are much wider risk of internal threats and makes it MUCH easier for an attacker who phishes someone to pivot to higher credentials."
"Also if you have audit records, you want accessing a secret to be logged separately from accessing logs."

Questioning the Philosophy of "Log Everything"

The practice of logging excessively ("logging everything") is questioned due to its cost and potential security implications.

"blkhawk: ... why are you logging everything you lazy asses and adding all the secrets into another tool just to scan for them in logs just adds another point for them to leak..."
"pavel_lishin: Why is logging everything considered lazy?"
"tonymet: for one it's extremely costly, in vcpu, storage, transfer rates. and if you're paying a third-party logger, multiply each by 10x"
"tonymet: the lazy part comes from the fact that it's easier to be foolish in this case than to be selective about what gets logged. So lazy & foolish."

Potential Technical Solutions and Tools

Several specific technical approaches and tools are mentioned as potential aids in managing secrets in logs.

"With java there's a GuardedString implementation https://docs.oracle.com/en/middleware/idm/identity-governance..."
- "secrets.forEach(secret => logMessage = logMessage.replaceAll(secret, '**'))"
"As far as run-time exposure prevention goes, I feel like in-band signaling might work better than out-of-band for this problem. Along the lines of the taint checking technique mentioned, you can insert some magic string (say, some recognizable prefix + a randomly generated UUID) into your sensitive strings at the source, that you then strip out at the sink. (Or wrap your secrets in a pair of such magic strings.) Then block or mask any strings containing that magic string from making it into any persisted data, including logs. And it will be easy to identify the points of exposure, since they will be wherever you call your respective seal()/unseal() function or such."
"One direction to venture would be running rsyslog on every node, using regex to match all the known patterns and use various plugins/addons to send all the applications to the local rsyslog instance... Rsyslog supports using a spooler so that if the up-stream server is offline for whatever reason the logs are spooled locally and then resume when upstream is online."
"Rsyslog also supports encrypting the log stream so that secret leakage is limited to the sending nodes and the central nodes and it checks a few boxes."
"Another thing that helps is limiting to warn and above sent upstream and using an agent on the local nodes to monitor for keywords in the range of info to debug to let someone know to go check the node logs."

Developer Responsibility and Awareness

The role of developers in this issue is also a recurring theme, emphasizing the need for education and careful practice.

"And while people will write the code that accidentally introduces sensitive data into logs, they’re also the ones that will report, respond, and fix them."
"You need to pass the secrets to the logger so it can be redacted, it's heavily dependent on the dev and easy to forget during review."
"This is an excellent example of how to approach & elucidate a problem domain."

The discussion also touches on meta-commentary about the nature of such discussions and the terminology used.

"Why do I have to know how many letters are in observability? Is this some kind of in group signaling?"