New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code Scanning support for external data #146
Comments
|
My thinking is that we should add a new config option to explicitly opt into csv extraction, and then issue the extra codeql cli commands where appropriate. |
|
Ping: @jhutchings1 We'll be starting work on this soon. Any concerns with this? |
|
Thanks for the ping on Slack, @ginsbach! I am not aware of any code scanning customers who have expressed interest in this (@jhutchings will be able to confirm). I only know of CodeQL power users who are interested in using CSV data. I'm sure we will see some use cases in this area in the future, but I don't think we should do anything now. So I think there is no need for further work here for the time being. |
|
Thanks for the ping, team. Sorry I missed this one earlier. I have yet to hear of any customer requesting this particular user scenario. I don't think we need to prioritize any work here for the moment. |
|
We will not work on any implementation for now. When customers request it, this issue should be a good place to start picking it up again. |
Background
For users that want to incorporate some external data into a database, odasa supported a workflow based on csv files. During database creation, a specific subdirectory was scanned for csv files, and the content of these files was used to populate the
externalDatatable that is present in thedbSchemefor each language. This feature was rarely used by users on their own. Instead it mostly gave us a flexible way to implement specific user requests in a "quick-and-dirty" way by rapidly extending databases with custom entries.This feature has recently been ported to the
codeql-cli. Incorporatingexternal_data.csvinto theexample_databaseto supplement the tables that are generated from compilingtest.crequires the following steps:codeql database init --language=cpp --source-root=. test_databasecodeql database trace-command test_database gcc test.ccodeql database index-files --language=csv --include=external_data.csv test_database// External data is added here.codeql database finalize test_databaseNote that this approach does not use any new command line arguments along the lines of
--external-data=external_data.csv. Instead, csv is treated as a full language and comes with an extractor that adheres to the common interface. Note furthermore thatexample_databaseis effectively a multi-language database without us really supporting multi-language databases - it only works because the csvdbSchemeis a subset of the cppdbScheme.It would be desirable to enable the use of external data in csv files for Code Scanning. Having it available in codeql should make the backend implementation straightforward, but there is some unclarity about how the interface for this should look.
The Challenge
We need to introduce this in a way that already anticipates multi-language databases. The aim should be to introduce
{main language} + csvdatabases in such a way that it does not become a special (legacy) case once multi-language dataset ship. Therefore, we don't want any notion of "external data" or "csv files" specifically in the Code Scanning configuration.Instead, we want to treat this as creating a two-language database (where one language happens to be "csv" and the csv files happen to contain "external data") that is configured via the same mechanism that will later be used for generic multi-language datasets. This means that we have to anticipate quite a few future design decisions.
Some initial discussions with @aeisenberg made it clear that this might touch on pretty complex design decisions on the Code Scanning side and will require significant input from @robertbrignull and @jhutchings1.
Potential Implementations
Fully automatic
We could always look for csv files in the entire repository when using
autobuild. All csv files would be incorporated into the database fully automatically, in addition to the detected principal language.This would make sense if it was the aim to abandon "autobuild [...] only ever attempts to build one compiled language for a repository" when multi-language databases become available. If eventually all present programming language files will be included in the generated database, then the proposed mechanism is quite natural. Otherwise, it would seem like an unjustified special treatment of csv files.
Explicit two-language
We could enable the automatic indexing of csv files in the repository only if precisely two langauges - one of them csv - have been explicitly selected in the configuration, along the lines of
Again, is it planned that multi-language databases will eventually be constructed this way? If so, then this would seem like a sensible approach. Otherwise, no so much.
Manual indexing
Instead of relying on
autobuild, indexing of csv files could be done with an explicitindexmechanism.It would be possible to just add a configuration option that directly corresponds to
codeql database index-files. The use of external data is quite advanced, so not having it available withautobuildwould seem reasonable. However, we would not want to introduce this kind of feature just for csv files. Therefore, this would probably only make sense if an explicit indexing command has been deemed useful in other contexts as well.Wait for multi-language databases
Finally, we could just wait and see what happens to multi-language databases. This would allow us to enable the use of external data with all the hindsight from implementing multiple languages more generally.
The text was updated successfully, but these errors were encountered: