Goal
Create a pipeline to quickly generate high quality, diverse, realistic benchmarking and ML training data for secret scanners.
Rough Idea
- Build up a repository of real private leaks to baseline this process
- Set up a benchmark system with the private data with a collection of scanners using the SIG's patterns
- Experiment with ML only scanning and feature extraction on the private data to baseline that process
- Set up Synthetic Data Generation (SDG) pipeline to generate synthetic variants1 of the private leaks
- Train the ML on the synthetic data and also check the extracted features to see if they have a similar distribution to the private data
- Compare the quality of things built on the synthetic data against the private data (through things like cross-validation and hunting for new leaks).
- Publish the process and build up a large set of training data for folks to use
Initial SDG Process Idea
Secrets themselves often have hidden structure that isn't documented, so for each secret type, create custom code that takes as much of this into consideration as possible.
This may be missing hidden structure, but for example for a Notion API token:
import string
import random
prefix = "ntn_"
alphanum = string.ascii_letters + string.digits
token = "".join(
(
prefix,
"".join(random.choices(string.digits, k=11)),
"".join(random.choices(alphanum, k=35)),
)
)
print(token)
To generate something like this:
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
And then for the context generation it's important to avoid license issues and the ability to look at code comments, method names, etc to search for the original leak. I believe this process should solve that (please correct me if you see any issues):
synthetic_secret = gen_synthetic_secret(secret_type)
important_features = extract_important_features(private_leak)
description = llm(describe_prompt, private_leak)
synthetic_leak = llm(generate_prompt, description, important_features)
Where
- gen_synthetic_secret is the specific secret generating code like the Notion snippet above.
- extract_important_features is custom code to pull out things we want to note about the source leak like language, maybe file size, and other parameters that would be passed to an LLM to shape the output but also generic enough to not run into copyright issues (i.e. "a python file with over 500 lines of code" is too generic to be copyrighted). -
- describe_prompt would be a prompt to the LLM to do something similar to
extract_important_features but to get some higher level features. It is important to get this prompt right and make sure it doesn't include any of the original content in the description. It might be safer to drop this and lean only on extract_important_features if it is enough. But being able to get some higher level context would be nice. There would need to be a review process.
- generate_prompt would be the prompt to generate the synthetic data given the description and important features2.
I would appreciate it if any SIG members have access to lawyers in their company that they could run this by. I'll try to do the same3.
Goal
Create a pipeline to quickly generate high quality, diverse, realistic benchmarking and ML training data for secret scanners.
Rough Idea
Initial SDG Process Idea
Secrets themselves often have hidden structure that isn't documented, so for each secret type, create custom code that takes as much of this into consideration as possible.
This may be missing hidden structure, but for example for a Notion API token:
To generate something like this:
And then for the context generation it's important to avoid license issues and the ability to look at code comments, method names, etc to search for the original leak. I believe this process should solve that (please correct me if you see any issues):
Where
extract_important_featuresbut to get some higher level features. It is important to get this prompt right and make sure it doesn't include any of the original content in the description. It might be safer to drop this and lean only onextract_important_featuresif it is enough. But being able to get some higher level context would be nice. There would need to be a review process.I would appreciate it if any SIG members have access to lawyers in their company that they could run this by. I'll try to do the same3.
Footnotes
The synthetic variants must NOT make it possible to find the original leak from the output and should be done in a way that avoids license issues. ↩
Example prompt that I've had good results with in my limited testing.
↩Any decisions on final synthetic data not being within the scope of copyright is the responsibility of the SIG member that submitted it and not their parent company. Any opinions expressed by a company's lawyer is only that, an opinion and is not the official position of that company. Our goal is to try to be good stewards of the data we are reviewing and plan to only use a description of it and not any of the original code. It is my understanding that simply saying a file is a go file with 300 lines and 5 methods does not constitute a derived work since the generated code will not contain any of the original nor be for a similar use. ↩