✨ [Sema, Lex, Parse] Preprocessor embed in C and C++ (and Obj-C and Obj-C++ by-proxy) #68620

ThePhD · 2023-10-09T19:01:55Z

This pull request implements the entirety of the now-accepted N3017 - Preprocessor Embed and its sister C++ paper p1967. It implements everything in the specification, and includes an implementation that drastically improves the time it takes to embed data in specific scenarios (the initialization of character type arrays). The mechanisms used to do this are used under the "as-if" rule, and in general when the system cannot detect it is initializing an array object in a variable declaration, will simply expand the list out to a sequence of numbers.

There are likely places where the __builtin_pp_embed intrinsic fails to compile properly and triggers ICEs. However, we have covered what we feel are the vast majority of cases where users would want or need the speedup.

Images are a dry run of a recently built clang and LLVM on Release mode (not RelWithDebInfo) on Windows, under the following properties:

OS Name: Microsoft Windows 10 Pro
Version: 10.0.19045 Build 19045
System Type: x64-based PC
Processor: AMD Ryzen 9 5950X 16-Core Processor, 3401 Mhz, 16 Core(s), 32 Logical Processor(s)
Installed Physical Memory (RAM): 32.0 GB
Total Physical Memory: 31.9 GB
Total Virtual Memory: 36.7 GB

All of the added tests pass under the above machine properties.

I have no intention of following up on this PR. I am too tired to carry it to fruition. Pick this apart, take from it what you want, or reimplement it entirely. (I will be unsubscribing from this immediately after posting.)

Good luck ✌!

llvmbot · 2023-10-09T19:04:05Z

@llvm/pr-subscribers-clang-static-analyzer-1
@llvm/pr-subscribers-clang-modules
@llvm/pr-subscribers-clang-format
@llvm/pr-subscribers-clang-driver
@llvm/pr-subscribers-clang

@llvm/pr-subscribers-llvm-support

Changes

This pull request implements the entirety of the now-accepted N3017 - Preprocessor Embed and its sister C++ paper p1967. It implements everything in the specification, and includes an implementation that drastically improves the time it takes to embed data in specific scenarios (the initialization of character type arrays). The mechanisms used to do this are used under the "as-if" rule, and in general when the system cannot detect it is initializing an array object in a variable declaration, will simply expand the list out to a sequence of numbers.

There are likely places where the __builtin_pp_embed intrinsic fails to compile properly and triggers ICEs. However, we have covered what we feel are the vast majority of cases where users would want or need the speedup.

Images are a dry run of a recently built clang and LLVM on Release mode (not RelWithDebInfo) on Windows, under the following properties:

OS Name: Microsoft Windows 10 Pro
Version: 10.0.19045 Build 19045
System Type: x64-based PC
Processor: AMD Ryzen 9 5950X 16-Core Processor, 3401 Mhz, 16 Core(s), 32 Logical Processor(s)
Installed Physical Memory (RAM): 32.0 GB
Total Physical Memory: 31.9 GB
Total Virtual Memory: 36.7 GB

All of the added tests pass under the above machine properties.

I have no intention of following up on this PR. I am too tired to carry it to fruition. Pick this apart, take from it what you want, or reimplement it entirely. (I will be unsubscribing from this immediately after posting.)

Good luck ✌!

Patch is 1.33 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/68620.diff

82 Files Affected:

(modified) clang/CMakeLists.txt (+2-1)
(modified) clang/include/clang/AST/Expr.h (+52)
(modified) clang/include/clang/AST/RecursiveASTVisitor.h (+1)
(modified) clang/include/clang/Basic/Builtins.def (+3)
(modified) clang/include/clang/Basic/DiagnosticCommonKinds.td (+6)
(modified) clang/include/clang/Basic/DiagnosticGroups.td (+7)
(modified) clang/include/clang/Basic/DiagnosticLexKinds.td (+23-1)
(modified) clang/include/clang/Basic/FileManager.h (+5-3)
(modified) clang/include/clang/Basic/StmtNodes.td (+1)
(modified) clang/include/clang/Basic/TokenKinds.def (+5)
(modified) clang/include/clang/Driver/Options.td (+16)
(modified) clang/include/clang/Frontend/PreprocessorOutputOptions.h (+2)
(modified) clang/include/clang/Lex/PPCallbacks.h (+118-82)
(added) clang/include/clang/Lex/PPDirectiveParameter.h (+32)
(added) clang/include/clang/Lex/PPEmbedParameters.h (+78)
(modified) clang/include/clang/Lex/Preprocessor.h (+191-128)
(modified) clang/include/clang/Lex/PreprocessorOptions.h (+7)
(modified) clang/include/clang/Lex/Token.h (+1-1)
(modified) clang/include/clang/Sema/Sema.h (+1124-1441)
(modified) clang/include/clang/Serialization/ASTBitCodes.h (+3)
(modified) clang/lib/AST/Expr.cpp (+17)
(modified) clang/lib/AST/ExprClassification.cpp (+4)
(modified) clang/lib/AST/ExprConstant.cpp (+8)
(modified) clang/lib/AST/ItaniumMangle.cpp (+1)
(modified) clang/lib/AST/StmtPrinter.cpp (+7)
(modified) clang/lib/AST/StmtProfile.cpp (+4)
(modified) clang/lib/Basic/FileManager.cpp (+41-35)
(modified) clang/lib/Basic/IdentifierTable.cpp (+2-1)
(modified) clang/lib/Driver/ToolChains/Clang.cpp (+4-1)
(modified) clang/lib/Format/FormatToken.h (+2)
(modified) clang/lib/Format/TokenAnnotator.cpp (+28)
(modified) clang/lib/Frontend/CompilerInvocation.cpp (+21)
(modified) clang/lib/Frontend/DependencyFile.cpp (+29)
(modified) clang/lib/Frontend/DependencyGraph.cpp (+51-15)
(modified) clang/lib/Frontend/InitPreprocessor.cpp (+7)
(modified) clang/lib/Frontend/PrintPreprocessedOutput.cpp (+50-32)
(modified) clang/lib/Frontend/Rewrite/InclusionRewriter.cpp (+98-85)
(modified) clang/lib/Interpreter/Interpreter.cpp (+1)
(modified) clang/lib/Lex/Lexer.cpp (+8)
(modified) clang/lib/Lex/PPCallbacks.cpp (-11)
(modified) clang/lib/Lex/PPDirectives.cpp (+942-174)
(modified) clang/lib/Lex/PPExpressions.cpp (+137-88)
(modified) clang/lib/Lex/PPMacroExpansion.cpp (+124)
(modified) clang/lib/Lex/Preprocessor.cpp (+3-2)
(modified) clang/lib/Parse/ParseDeclCXX.cpp (-1)
(modified) clang/lib/Parse/ParseExpr.cpp (+361-308)
(modified) clang/lib/Parse/ParseInit.cpp (+3-3)
(modified) clang/lib/Parse/ParseTemplate.cpp (+2)
(modified) clang/lib/Sema/SemaDecl.cpp (+1389-1383)
(modified) clang/lib/Sema/SemaDeclCXX.cpp (+2-1)
(modified) clang/lib/Sema/SemaExceptionSpec.cpp (+1)
(modified) clang/lib/Sema/SemaExpr.cpp (+1799-1483)
(modified) clang/lib/Sema/SemaTemplate.cpp (+947-1002)
(modified) clang/lib/Sema/SemaTemplateVariadic.cpp (+2-2)
(modified) clang/lib/Sema/TreeTransform.h (+7)
(modified) clang/lib/Serialization/ASTReaderStmt.cpp (+13)
(modified) clang/lib/Serialization/ASTWriterStmt.cpp (+10)
(modified) clang/lib/StaticAnalyzer/Core/ExprEngine.cpp (+4)
(added) clang/test/Preprocessor/Inputs/jk.txt (+1)
(added) clang/test/Preprocessor/Inputs/media/art.txt (+9)
(added) clang/test/Preprocessor/Inputs/media/empty ()
(added) clang/test/Preprocessor/Inputs/single_byte.txt (+1)
(added) clang/test/Preprocessor/embed___has_embed.c (+34)
(added) clang/test/Preprocessor/embed___has_embed_supported.c (+24)
(added) clang/test/Preprocessor/embed_art.c (+106)
(added) clang/test/Preprocessor/embed_feature_test.cpp (+13)
(added) clang/test/Preprocessor/embed_file_not_found.c (+4)
(added) clang/test/Preprocessor/embed_init.c (+28)
(added) clang/test/Preprocessor/embed_parameter_if_empty.c (+16)
(added) clang/test/Preprocessor/embed_parameter_limit.c (+15)
(added) clang/test/Preprocessor/embed_parameter_offset.c (+15)
(added) clang/test/Preprocessor/embed_parameter_prefix.c (+15)
(added) clang/test/Preprocessor/embed_parameter_suffix.c (+15)
(added) clang/test/Preprocessor/embed_parameter_unrecognized.c (+8)
(added) clang/test/Preprocessor/embed_path_chevron.c (+8)
(added) clang/test/Preprocessor/embed_path_quote.c (+8)
(added) clang/test/Preprocessor/embed_single_entity.c (+7)
(added) clang/test/Preprocessor/embed_weird.cpp (+68)
(added) clang/test/Preprocessor/single_byte.txt (+1)
(modified) llvm/CMakeLists.txt (+8)
(modified) llvm/cmake/modules/GetHostTriple.cmake (+3-3)
(modified) llvm/include/llvm/Support/Base64.h (+21-15)

diff --git a/clang/CMakeLists.txt b/clang/CMakeLists.txt
index 9b52c58be41e7f7..1b88905da3b8597 100644
--- a/clang/CMakeLists.txt
+++ b/clang/CMakeLists.txt
@@ -300,6 +300,7 @@ configure_file(
   ${CMAKE_CURRENT_BINARY_DIR}/include/clang/Basic/Version.inc)
 
 # Add appropriate flags for GCC
+option(CLANG_ENABLE_PEDANTIC "Compile with pedantic enabled." ON)
 if (LLVM_COMPILER_IS_GCC_COMPATIBLE)
   set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fno-common -Woverloaded-virtual")
   if (NOT "${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang")
@@ -307,7 +308,7 @@ if (LLVM_COMPILER_IS_GCC_COMPATIBLE)
   endif ()
 
   # Enable -pedantic for Clang even if it's not enabled for LLVM.
-  if (NOT LLVM_ENABLE_PEDANTIC)
+  if (NOT LLVM_ENABLE_PEDANTIC AND CLANG_ENABLE_PEDANTIC)
     set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -pedantic -Wno-long-long")
   endif ()
 
diff --git a/clang/include/clang/AST/Expr.h b/clang/include/clang/AST/Expr.h
index b69c616b0090365..9303307fd6a8af5 100644
--- a/clang/include/clang/AST/Expr.h
+++ b/clang/include/clang/AST/Expr.h
@@ -4805,6 +4805,58 @@ class SourceLocExpr final : public Expr {
   friend class ASTStmtReader;
 };
 
+/// Represents a function call to __builtin_pp_embed().
+class PPEmbedExpr final : public Expr {
+  SourceLocation BuiltinLoc, RParenLoc;
+  DeclContext *ParentContext;
+  StringLiteral *Filename;
+  StringLiteral *BinaryData;
+
+public:
+  enum Action {
+    NotFound,
+    FoundOne,
+    Expanded,
+  };
+ 
+  PPEmbedExpr(const ASTContext &Ctx, QualType ResultTy, StringLiteral* Filename, StringLiteral* BinaryData,
+                SourceLocation BLoc, SourceLocation RParenLoc,
+                DeclContext *Context);
+
+  /// Build an empty call expression.
+  explicit PPEmbedExpr(EmptyShell Empty)
+      : Expr(SourceLocExprClass, Empty) {}
+
+  /// If the PPEmbedExpr has been resolved return the subexpression
+  /// representing the resolved value. Otherwise return null.
+  const DeclContext *getParentContext() const { return ParentContext; }
+  DeclContext *getParentContext() { return ParentContext; }
+
+  SourceLocation getLocation() const { return BuiltinLoc; }
+  SourceLocation getBeginLoc() const { return BuiltinLoc; }
+  SourceLocation getEndLoc() const { return RParenLoc; }
+
+  StringLiteral *getFilenameStringLiteral() const { return Filename; }
+  StringLiteral *getDataStringLiteral() const { return BinaryData; }
+
+  size_t getDataElementCount(ASTContext &Context) const;
+
+  child_range children() {
+    return child_range(child_iterator(), child_iterator());
+  }
+
+  const_child_range children() const {
+    return const_child_range(child_iterator(), child_iterator());
+  }
+
+  static bool classof(const Stmt *T) {
+    return T->getStmtClass() == PPEmbedExprClass;
+  }
+
+private:
+  friend class ASTStmtReader;
+};
+
 /// Describes an C or C++ initializer list.
 ///
 /// InitListExpr describes an initializer list, which can be used to
diff --git a/clang/include/clang/AST/RecursiveASTVisitor.h b/clang/include/clang/AST/RecursiveASTVisitor.h
index 3dd23eb38eeabfc..6b7211bb0a0d3f1 100644
--- a/clang/include/clang/AST/RecursiveASTVisitor.h
+++ b/clang/include/clang/AST/RecursiveASTVisitor.h
@@ -2809,6 +2809,7 @@ DEF_TRAVERSE_STMT(ShuffleVectorExpr, {})
 DEF_TRAVERSE_STMT(ConvertVectorExpr, {})
 DEF_TRAVERSE_STMT(StmtExpr, {})
 DEF_TRAVERSE_STMT(SourceLocExpr, {})
+DEF_TRAVERSE_STMT(PPEmbedExpr, {})
 
 DEF_TRAVERSE_STMT(UnresolvedLookupExpr, {
   TRY_TO(TraverseNestedNameSpecifierLoc(S->getQualifierLoc()));
diff --git a/clang/include/clang/Basic/Builtins.def b/clang/include/clang/Basic/Builtins.def
index 6ea8484606cfd5d..0dfc6456daf059a 100644
--- a/clang/include/clang/Basic/Builtins.def
+++ b/clang/include/clang/Basic/Builtins.def
@@ -1766,6 +1766,9 @@ BUILTIN(__builtin_ms_va_copy, "vc*&c*&", "n")
 // Arithmetic Fence: to prevent FP reordering and reassociation optimizations
 LANGBUILTIN(__arithmetic_fence, "v.", "tE", ALL_LANGUAGES)
 
+// preprocessor embed builtin
+LANGBUILTIN(__builtin_pp_embed, "v.", "tE", ALL_LANGUAGES)
+
 #undef BUILTIN
 #undef LIBBUILTIN
 #undef LANGBUILTIN
diff --git a/clang/include/clang/Basic/DiagnosticCommonKinds.td b/clang/include/clang/Basic/DiagnosticCommonKinds.td
index f2df283c74829f6..4df86e35eebde38 100644
--- a/clang/include/clang/Basic/DiagnosticCommonKinds.td
+++ b/clang/include/clang/Basic/DiagnosticCommonKinds.td
@@ -59,6 +59,9 @@ def err_expected_string_literal : Error<"expected string literal "
           "'external_source_symbol' attribute|"
           "as argument of '%1' attribute}0">;
 
+def err_builtin_pp_embed_invalid_argument : Error<
+  "invalid argument to '__builtin_pp_embed': %0">;
+
 def err_invalid_string_udl : Error<
   "string literal with user-defined suffix cannot be used here">;
 def err_invalid_character_udl : Error<
@@ -80,6 +83,9 @@ def err_expected : Error<"expected %0">;
 def err_expected_either : Error<"expected %0 or %1">;
 def err_expected_after : Error<"expected %1 after %0">;
 
+def err_builtin_pp_embed_invalid_location : Error<
+  "'__builtin_pp_embed' in invalid location: %0%select{|%2}1">;
+
 def err_param_redefinition : Error<"redefinition of parameter %0">;
 def warn_method_param_redefinition : Warning<"redefinition of method parameter %0">;
 def warn_method_param_declaration : Warning<"redeclaration of method parameter %0">,
diff --git a/clang/include/clang/Basic/DiagnosticGroups.td b/clang/include/clang/Basic/DiagnosticGroups.td
index 0b09c002191848a..7ebea56891d4654 100644
--- a/clang/include/clang/Basic/DiagnosticGroups.td
+++ b/clang/include/clang/Basic/DiagnosticGroups.td
@@ -708,6 +708,12 @@ def ReservedIdAsMacro : DiagGroup<"reserved-macro-identifier">;
 def ReservedIdAsMacroAlias : DiagGroup<"reserved-id-macro", [ReservedIdAsMacro]>;
 def RestrictExpansionMacro : DiagGroup<"restrict-expansion">;
 def FinalMacro : DiagGroup<"final-macro">;
+// Warnings about unknown preprocessor parameters (e.g. `#embed` and extensions)
+def UnsupportedDirective : DiagGroup<"unsupported-directive">;
+def UnknownDirectiveParameters : DiagGroup<"unknown-directive-parameters">;
+def IgnoredDirectiveParameters : DiagGroup<"ignored-directive-parameters">;
+def DirectiveParameters : DiagGroup<"directive-parameters",
+    [UnknownDirectiveParameters, IgnoredDirectiveParameters]>;
 
 // Just silence warnings about -Wstrict-aliasing for now.
 def : DiagGroup<"strict-aliasing=0">;
@@ -715,6 +721,7 @@ def : DiagGroup<"strict-aliasing=1">;
 def : DiagGroup<"strict-aliasing=2">;
 def : DiagGroup<"strict-aliasing">;
 
+
 // Just silence warnings about -Wstrict-overflow for now.
 def : DiagGroup<"strict-overflow=0">;
 def : DiagGroup<"strict-overflow=1">;
diff --git a/clang/include/clang/Basic/DiagnosticLexKinds.td b/clang/include/clang/Basic/DiagnosticLexKinds.td
index 940cca67368492f..4490f40806b0345 100644
--- a/clang/include/clang/Basic/DiagnosticLexKinds.td
+++ b/clang/include/clang/Basic/DiagnosticLexKinds.td
@@ -422,6 +422,22 @@ def warn_cxx23_compat_warning_directive : Warning<
 def warn_c23_compat_warning_directive : Warning<
   "#warning is incompatible with C standards before C23">,
   InGroup<CPre23Compat>, DefaultIgnore;
+def warn_c23_pp_embed : Warning<
+  "'__has_embed' is a C23 extension">,
+  InGroup<CPre23Compat>,
+  DefaultIgnore;
+def warn_c23_pp_has_embed : Warning<
+  "'__has_embed' is a C23 extension">,
+  InGroup<CPre23Compat>,
+  DefaultIgnore;
+def warn_cxx26_pp_embed : Warning<
+  "'__has_embed' is a C++26 extension">,
+  InGroup<CXXPre26Compat>,
+  DefaultIgnore;
+def warn_cxx26_pp_has_embed : Warning<
+  "'__has_embed' is a C++26 extension">,
+  InGroup<CXXPre26Compat>,
+  DefaultIgnore;
 
 def ext_pp_extra_tokens_at_eol : ExtWarn<
   "extra tokens at end of #%0 directive">, InGroup<ExtraTokens>;
@@ -483,7 +499,13 @@ def ext_pp_gnu_line_directive : Extension<
 def err_pp_invalid_directive : Error<
   "invalid preprocessing directive%select{|, did you mean '#%1'?}0">;
 def warn_pp_invalid_directive : Warning<
-  err_pp_invalid_directive.Summary>, InGroup<DiagGroup<"unknown-directives">>;
+  err_pp_invalid_directive.Summary>,
+  InGroup<UnsupportedDirective>;
+def warn_pp_unknown_parameter_ignored : Warning<
+  "unknown%select{ | embed}0 preprocessor parameter '%1' ignored">,
+  InGroup<UnknownDirectiveParameters>;
+def err_pp_unsupported_directive : Error<
+  "unsupported%select{ | embed}0 directive: %1">;
 def err_pp_directive_required : Error<
   "%0 must be used within a preprocessing directive">;
 def err_pp_file_not_found : Error<"'%0' file not found">, DefaultFatal;
diff --git a/clang/include/clang/Basic/FileManager.h b/clang/include/clang/Basic/FileManager.h
index 56cb093dd8c376f..c757f8775b425e9 100644
--- a/clang/include/clang/Basic/FileManager.h
+++ b/clang/include/clang/Basic/FileManager.h
@@ -276,11 +276,13 @@ class FileManager : public RefCountedBase<FileManager> {
   /// MemoryBuffer if successful, otherwise returning null.
   llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>>
   getBufferForFile(FileEntryRef Entry, bool isVolatile = false,
-                   bool RequiresNullTerminator = true);
+                   bool RequiresNullTerminator = true,
+                   std::optional<int64_t> MaybeLimit = std::nullopt);
   llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>>
   getBufferForFile(StringRef Filename, bool isVolatile = false,
-                   bool RequiresNullTerminator = true) {
-    return getBufferForFileImpl(Filename, /*FileSize=*/-1, isVolatile,
+                   bool RequiresNullTerminator = true,
+                   std::optional<int64_t> MaybeLimit = std::nullopt) {
+    return getBufferForFileImpl(Filename, /*FileSize=*/(MaybeLimit ? *MaybeLimit : -1), isVolatile,
                                 RequiresNullTerminator);
   }
 
diff --git a/clang/include/clang/Basic/StmtNodes.td b/clang/include/clang/Basic/StmtNodes.td
index cec301dfca2817b..e3be997dd1c86e0 100644
--- a/clang/include/clang/Basic/StmtNodes.td
+++ b/clang/include/clang/Basic/StmtNodes.td
@@ -203,6 +203,7 @@ def OpaqueValueExpr : StmtNode<Expr>;
 def TypoExpr : StmtNode<Expr>;
 def RecoveryExpr : StmtNode<Expr>;
 def BuiltinBitCastExpr : StmtNode<ExplicitCastExpr>;
+def PPEmbedExpr : StmtNode<Expr>;
 
 // Microsoft Extensions.
 def MSPropertyRefExpr : StmtNode<Expr>;
diff --git a/clang/include/clang/Basic/TokenKinds.def b/clang/include/clang/Basic/TokenKinds.def
index 94db56a9fd5d78c..167bd614efe7bd9 100644
--- a/clang/include/clang/Basic/TokenKinds.def
+++ b/clang/include/clang/Basic/TokenKinds.def
@@ -126,6 +126,9 @@ PPKEYWORD(error)
 // C99 6.10.6 - Pragma Directive.
 PPKEYWORD(pragma)
 
+// C23 & C++26 #embed
+PPKEYWORD(embed)
+
 // GNU Extensions.
 PPKEYWORD(import)
 PPKEYWORD(include_next)
@@ -751,6 +754,7 @@ ALIAS("__char32_t"   , char32_t          , KEYCXX)
 KEYWORD(__builtin_bit_cast               , KEYALL)
 KEYWORD(__builtin_available              , KEYALL)
 KEYWORD(__builtin_sycl_unique_stable_name, KEYSYCL)
+KEYWORD(__builtin_pp_embed               , KEYALL)
 
 // Keywords defined by Attr.td.
 #ifndef KEYWORD_ATTRIBUTE
@@ -986,6 +990,7 @@ ANNOTATION(repl_input_end)
 #undef CXX11_KEYWORD
 #undef KEYWORD
 #undef PUNCTUATOR
+#undef BUILTINOK
 #undef TOK
 #undef C99_KEYWORD
 #undef C23_KEYWORD
diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index 5415b18d3f406df..bfc4b15d5411cde 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -114,6 +114,11 @@ def IncludePath_Group : OptionGroup<"<I/i group>">, Group<Preprocessor_Group>,
                         DocBrief<[{
 Flags controlling how ``#include``\s are resolved to files.}]>;
 
+def EmbedPath_Group : OptionGroup<"<Embed group>">, Group<Preprocessor_Group>,
+                        DocName<"Embed path management">,
+                        DocBrief<[{
+Flags controlling how ``#embed``\s and similar are resolved to files.}]>;
+
 def I_Group : OptionGroup<"<I group>">, Group<IncludePath_Group>, DocFlatten;
 def i_Group : OptionGroup<"<i group>">, Group<IncludePath_Group>, DocFlatten;
 def clang_i_Group : OptionGroup<"<clang i group>">, Group<i_Group>, DocFlatten;
@@ -816,6 +821,14 @@ will be ignored}]>;
 def L : JoinedOrSeparate<["-"], "L">, Flags<[RenderJoined]>, Group<Link_Group>,
     Visibility<[ClangOption, FlangOption]>,
     MetaVarName<"<dir>">, HelpText<"Add directory to library search path">;
+def embed_dir : JoinedOrSeparate<["-"], "embed-dir">,
+    Flags<[RenderJoined]>, Group<EmbedPath_Group>,
+    Visibility<[ClangOption, CC1Option, CC1AsOption, FlangOption, FC1Option]>,
+    MetaVarName<"<dir>">, HelpText<"Add directory to embed search path">;
+def embed_dir_EQ : JoinedOrSeparate<["-"], "embed-dir=">,
+    Flags<[RenderJoined]>, Group<EmbedPath_Group>,
+    Visibility<[ClangOption, CC1Option, CC1AsOption, FlangOption, FC1Option]>,
+    MetaVarName<"<dir>">, HelpText<"Add directory to embed search path">;
 def MD : Flag<["-"], "MD">, Group<M_Group>,
     HelpText<"Write a depfile containing user and system headers">;
 def MMD : Flag<["-"], "MMD">, Group<M_Group>,
@@ -1353,6 +1366,9 @@ def dD : Flag<["-"], "dD">, Group<d_Group>, Visibility<[ClangOption, CC1Option]>
 def dI : Flag<["-"], "dI">, Group<d_Group>, Visibility<[ClangOption, CC1Option]>,
   HelpText<"Print include directives in -E mode in addition to normal output">,
   MarshallingInfoFlag<PreprocessorOutputOpts<"ShowIncludeDirectives">>;
+def dE : Flag<["-"], "dE">, Group<d_Group>, Visibility<[ClangOption, CC1Option]>,
+  HelpText<"Print embed directives in -E mode in addition to normal output">,
+  MarshallingInfoFlag<PreprocessorOutputOpts<"ShowEmbedDirectives">>;
 def dM : Flag<["-"], "dM">, Group<d_Group>, Visibility<[ClangOption, CC1Option]>,
   HelpText<"Print macro definitions in -E mode instead of normal output">;
 def dead__strip : Flag<["-"], "dead_strip">;
diff --git a/clang/include/clang/Frontend/PreprocessorOutputOptions.h b/clang/include/clang/Frontend/PreprocessorOutputOptions.h
index db2ec9f2ae20698..3e36db3f8ce46ea 100644
--- a/clang/include/clang/Frontend/PreprocessorOutputOptions.h
+++ b/clang/include/clang/Frontend/PreprocessorOutputOptions.h
@@ -22,6 +22,7 @@ class PreprocessorOutputOptions {
   unsigned ShowMacroComments : 1;  ///< Show comments, even in macros.
   unsigned ShowMacros : 1;         ///< Print macro definitions.
   unsigned ShowIncludeDirectives : 1;  ///< Print includes, imports etc. within preprocessed output.
+  unsigned ShowEmbedDirectives : 1;  ///< Print embeds, etc. within preprocessed output.
   unsigned RewriteIncludes : 1;    ///< Preprocess include directives only.
   unsigned RewriteImports  : 1;    ///< Include contents of transitively-imported modules.
   unsigned MinimizeWhitespace : 1; ///< Ignore whitespace from input.
@@ -37,6 +38,7 @@ class PreprocessorOutputOptions {
     ShowMacroComments = 0;
     ShowMacros = 0;
     ShowIncludeDirectives = 0;
+    ShowEmbedDirectives = 0;
     RewriteIncludes = 0;
     RewriteImports = 0;
     MinimizeWhitespace = 0;
diff --git a/clang/include/clang/Lex/PPCallbacks.h b/clang/include/clang/Lex/PPCallbacks.h
index 94f96cf9c512541..7028c81dbc84aac 100644
--- a/clang/include/clang/Lex/PPCallbacks.h
+++ b/clang/include/clang/Lex/PPCallbacks.h
@@ -22,11 +22,11 @@
 #include "llvm/ADT/StringRef.h"
 
 namespace clang {
-  class Token;
-  class IdentifierInfo;
-  class MacroDefinition;
-  class MacroDirective;
-  class MacroArgs;
+class Token;
+class IdentifierInfo;
+class MacroDefinition;
+class MacroDirective;
+class MacroArgs;
 
 /// This interface provides a way to observe the actions of the
 /// preprocessor as it does its thing.
@@ -36,9 +36,7 @@ class PPCallbacks {
 public:
   virtual ~PPCallbacks();
 
-  enum FileChangeReason {
-    EnterFile, ExitFile, SystemHeaderPragma, RenameFile
-  };
+  enum FileChangeReason { EnterFile, ExitFile, SystemHeaderPragma, RenameFile };
 
   /// Callback invoked whenever a source file is entered or exited.
   ///
@@ -47,8 +45,7 @@ class PPCallbacks {
   /// the file before the new one entered for \p Reason EnterFile.
   virtual void FileChanged(SourceLocation Loc, FileChangeReason Reason,
                            SrcMgr::CharacteristicKind FileType,
-                           FileID PrevFID = FileID()) {
-  }
+                           FileID PrevFID = FileID()) {}
 
   enum class LexedFileChangeReason { EnterFile, ExitFile };
 
@@ -83,6 +80,47 @@ class PPCallbacks {
                            const Token &FilenameTok,
                            SrcMgr::CharacteristicKind FileType) {}
 
+  /// Callback invoked whenever the preprocessor cannot find a file for an
+  /// embed directive.
+  ///
+  /// \param FileName The name of the file being included, as written in the
+  /// source code.
+  ///
+  /// \returns true to indicate that the preprocessor should skip this file
+  /// and not issue any diagnostic.
+  virtual bool EmbedFileNotFound(StringRef FileName) { return false; }
+
+  /// Callback invoked whenever an embed directive has been processed,
+  /// regardless of whether the embed will actually find a file.
+  ///
+  /// \param HashLoc The location of the '#' that starts the embed directive.
+  ///
+  /// \param FileName The name of the file being included, as written in the
+  /// source code.
+  ///
+  /// \param IsAngled Whether the file name was enclosed in angle brackets;
+  /// otherwise, it was enclosed in quotes.
+  ///
+  /// \param FilenameRange The character range of the quotes or angle brackets
+  /// for the written file name.
+  ///
+  /// \param ParametersRange The character range of the embed parameters. An
+  /// empty range if there were no parameters.
+  ///
+  /// \param File The actual file that may be included by this embed directive.
+  ///
+  /// \param SearchPath Contains the search path which was used to find the file
+  /// in the file system. If the file was found via an absolute path,
+  /// SearchPath will be empty.
+  ///
+  /// \param RelativePath The path relative to SearchPath, at which the resource
+  /// file was found. This is equal to FileName.
+  virtual void EmbedDirective(SourceLocation HashLoc, StringRef FileName,
+                              bool IsAngled, CharSourceRange FilenameRange,
+                              CharSourceRange ParametersRange,
+                              OptionalFileEntryRef File, StringRef SearchPath,
+                              StringRef RelativePath) {}
+
   /// Callback invoked whenever the preprocessor cannot find a file for an
   /// inclusion directive.
   ///
@@ -151,7 +189,7 @@ class PPCallbacks {
   /// \param ForPragma If entering from pragma directive.
   ///
   virtual void EnteredSubmodule(Module *M, SourceLocation ImportLoc,
-                                bool ForPragma) { }
+                                bool ForPragma) {}
 
   /// Callback invoked whenever a submodule was left.
   ///
@@ -162,7 +200,7 @@ class PPCallbacks {
   /// \param ForPragma If entering from pragma directive.
   ///
   virtual void LeftSubmodule(Module *M, SourceLocation ImportLoc,
-                             bool ForPragma) { }
+                             bool ForPragma) {}
 
   /// Callback invoked whenever there was an explicit module-import
   /// syntax.
@@ -174,49 +212,40 @@ class PPCallbacks {
   ///
   /// \param Imported The imported module; can be null if importing failed.
   ///
-  virtual void moduleImport(SourceLocation ImportLoc,
-                            ModuleIdPath Path,
-                            const Module *Imported) {
-  }
+  virtual void moduleImport(SourceLocation ImportLoc, ModuleIdPath Path,
+                            const Module *Imported) {}
 
   /// Callback invoked when the end of the main file is reached.
   ///
   /// No subsequent callbacks will be made.
-  virtual void EndOfMainFile() {
-  }
+  virtual void EndOfMainFile() {}
 
   /// Callback invoked when a \#ident or \#sccs directive is read.
   /// \param Loc The location of the directive.
   /// \param str The text of the directive.
   ///
-  virtual void Ident(SourceLocation Loc, StringRef str) {
-  }
+  virtual void Ident(SourceLocation Loc, StringRef str) {}
 
   /// Callback invoked when start reading any pragma directive.
   virtual void PragmaDirective(SourceLocation Loc,
-                               PragmaIntroducerKind Introducer) {
-  }
+                               PragmaIntroducerKind Introducer) {}
 
   /// Callback invoked when a \#p...
[truncated]

github-actions · 2023-10-09T19:04:12Z

✅ With the latest revision this PR passed the C/C++ code formatter.

clang/lib/Format/TokenAnnotator.cpp

AaronBallman · 2023-10-09T19:47:19Z

FWIW, I spoke offline with the original author of the PR and he said that he's fine with me picking up the changes and carrying the review forward.

Because I don't know of any better way to commandeer a patch in GitHub, I'll probably grab the changes, get them into my own fork, and then recreate the PR to get the review started. Whenever we get ready to land the PR, @ThePhD will be credited as a co-author.

h-vetinari · 2023-10-09T22:59:30Z

Because I don't know of any better way to commandeer a patch in GitHub

As a maintainer, you can push into this branch (unless @ThePhD unchecked the default setting when creating the PR), including forcefully.

For example (you could also use the github CLI, but I'm using vanilla git to demo):

git remote add ThePhD https://github.com/ThePhD/llvm-project
git fetch ThePhD
git checkout thephd/embed-speed  # git will tell you you've got this branch, tracking the ThePhD remote
# do changes
git push ThePhD  # add `-f` if you've modified the existing commits, e.g. via rebase

jyknight · 2023-10-09T23:44:58Z

This pull request implements the entirety of the now-accepted N3017 - Preprocessor Embed.

Amazing! I had started to think about looking into getting this implemented recently, so it's really nice to see an implementation uploaded now!

I have no intention of following up on this PR. I am too tired to carry it to fruition. Pick this apart, take from it what you want, or reimplement it entirely.

I'm very sorry to hear that, but thank you for the contribution regardless!

shafik · 2023-10-10T21:13:34Z

FWIW, I spoke offline with the original author of the PR and he said that he's fine with me picking up the changes and carrying the review forward.

Because I don't know of any better way to commandeer a patch in GitHub, I'll probably grab the changes, get them into my own fork, and then recreate the PR to get the review started. Whenever we get ready to land the PR, @ThePhD will be credited as a co-author.

So just to clarify, we should wait till you post a follow-up review before commenting, to avoid fragmenting the discussion.

h-vetinari · 2023-10-11T03:52:12Z

Because I don't know of any better way to commandeer a patch in GitHub

As a maintainer, you can push into this branch

I had reached out to @ThePhD with an offer to help on this before they opened this PR, and I now have push access to their fork, which means I could change the state of this PR.

To be clear, I cannot offer to defend the content (semantically) in the face of review, but - if desired - I'd be happy to help with some more routine operations like rebasing, formatting, or chopping off pieces into separate PRs (all while keeping @ThePhD's authorship, of course).

AaronBallman · 2023-10-11T12:13:29Z

Because I don't know of any better way to commandeer a patch in GitHub

As a maintainer, you can push into this branch

I had reached out to @ThePhD with an offer to help on this before they opened this PR, and I now have push access to their fork, which means I could change the state of this PR.

To be clear, I cannot offer to defend the content (semantically) in the face of review, but - if desired - I'd be happy to help with some more routine operations like rebasing, formatting, or chopping off pieces into separate PRs (all while keeping @ThePhD's authorship, of course).

Oh, that's excellent! @ThePhD also gave me push access to his repo. If we're going to tag-team efforts on this and we both have access, it probably makes sense to just leave it in his repo and work from there. If he finds the notifications too distracting, we can always move the work elsewhere at that point.

@h-vetinari -- would you mind doing some initial cleanup on the patch for things like rebasing, removing spurious formatting changes, naming, and the likes? I have WG14 meetings coming up next week (they should hopefully go quickly though), so it is likely going to be ~a week before I have the chance to work on this more heavily.

So just to clarify, we should wait till you post a follow-up review before commenting, to avoid fragmenting the discussion.

Given the above, I'd say we can keep the review going here.

h-vetinari · 2023-10-12T06:41:40Z

@h-vetinari -- would you mind doing some initial cleanup on the patch for things like rebasing, removing spurious formatting changes, naming, and the likes?

I'll try to do that!

…+26. 🛠 [Frontend] Ensure commas inserted by #embed are properly serialized to output

h-vetinari · 2023-10-14T09:28:46Z

OK, I've got a first rebase going. However, because iterating on a diff with 1000s of lines is extremely hard to review and bound to lead to mistakes and iterations (making it harder still to follow what's going on), I've tried to follow a more structured approach.

Basically, I've pushed tags for intermediate milestones during the rebase to my fork, and am documenting below what I've done. That should make it easier to:

understand how the diff evolved
make comparisons between the various different stages (e.g. one can add /compare/<tag>..<tag> on Github)
roll back to intermediate iterations if something goes wrong, without having to start from scratch
validate what I did (for those so inclined) or branch off in a different direction

I've used a simple tag scheme embed_vX[.Y], where X increases if exisiting commits were changed, and Y increases if changes happened without rebasing. I did some basic compilation tests to ensure I didn't accidentally drop/introduce changes that break the build, but no more than that.

The table is probably busted visually on mobile (sorry...), but I hope it reads alright on a desktop. Where used, equality (==) refers to the content of the tree at the commits being compared, not the diffs in those commits.

tag	description	steps to reproduce	compile-tested¹	diff to prev. version
`embed_v0`	the state of the PR as opened originally (base `08b20d8`)	check out the tag	`32c8097` ✅ (presumably) `357bda5` ❌ (needs `adc9737`) `dfd6383` ❌ `adc9737` ✅
`embed_v1`	separated superfluous formatting changes from first 2 commits (base `08b20d8`)	Interactive rebase galore (the most painful part)	`0b4b6a6` ✅ `a05117d` ✅ (==v0:`32c8097`) `8231b72` ❌ (needs `36c1a5f`) `0a6f48b` ❌ (==v0:`357bda5`) `b356860` ❌ (==v0:`dfd6383`) `36c1a5f` ✅ (==v0:`adc9737`)	empty
`embed_v1.1`	added suggestions from `git-clang-format` (base `08b20d8`)	C&P + `git apply`	v1 + `26c6fcc` ✅	1 commit; no rebase
`embed_v2`	undo superfluous formatting changes (base `08b20d8`)	`git rebase -i ...`; drop `a05117d` & `0a6f48b`, fix minor conflicts in `26c6fcc`	`0b4b6a6` ✅ (same as v1) `07795f2` ❌ (needs `748c3b5`) `1179116` ❌ `748c3b5` ✅ (likely; untested) `b6d053e` ✅	uninteresting
`embed_v3`	squash small commits (base `08b20d8`)	`git rebase -i ...`; squash	`0b4b6a6` ✅ (same as v1) `a6f134d` ✅	empty
`embed_v4`	rebase (cleanly) on main (base `7060422`)	`git rebase upstream/main`	`7050c93` ❔ (untested) `5956fc2` ❔ (untested)	uninteresting
`embed_v4.1`	formatting fix-ups (base `7060422`)	apply suggestions; inspect diff in UI	v4 + `01983b4` + `5f5d3a6`	2 commits; no rebase
`embed_v5`	squash fix-ups (base `7060422`)	`git rebase -i ...`; squash	`7050c93` ❔ (same as v4) `6a7a4c9` ❔ (untested)	minor vs. v4 empty vs. v4.1

Since the rebase from v3 to v4 went cleanly, I haven't run the compilation tests. I will push v4 v5 now, and hope that this is a good enough jumping off point for further review and iteration.

cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang" -DLLVM_BUILD_TOOLS=OFF -DLLVM_CCACHE_BUILD=ON ..\llvm ↩

⚡ [Lex] Better reservations for improved performance/memory usage. 🛠 [Lex, Frontend] Remove comma hardcoding since we are servicing a full file apply suggestions from git-clang-format

h-vetinari · 2023-10-20T03:13:10Z

JeanHeyd wrote up a blog post about this implementation, which is probably helpful as background material for any prospective reviewers. :)

clang/include/clang/Driver/Options.td

h-vetinari · 2023-10-25T11:55:54Z

@AaronBallman, hope the WG14 meeting went well? I've essentially done what I planned to do here, so I was wondering if you're ready to take it from here? I can try to incorporate the review feedback myself, but it's quite likely I'd make a mess of it (not familiar with the clang code base at all). 😅

Fznamznon · 2024-06-12T18:45:52Z

@vitalybuka @rorth thanks for the reports.
Unfortunately, it is way past working hours in my time zone, so I can only take close look tomorrow.
If the problem is blocking and can't wait till morning in Europe, please feel free to revert.

…-C and Obj-C++ by-proxy)" (#95299) Reverts #68620 Introduce or expose a memory leak and UB, see #68620

jakubjelinek · 2024-06-14T08:29:11Z

Note, the version of clang with the above changes I've tried still crashes on the embed-1.c test I'm writing for GCC test coverage, has some I believe incorrect parsing issues and if all the problematic stuff is commented out, fails at runtime
which IMHO shouldn't but I'm open for arguing about it.
Testcase in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105863#c14
(note, GPL licensed).
The test is in gcc/testsuite/c-c++-common/cpp/ directory and needs to have
embed-dir/embed-1.inc
embed-dir/magna-carta.txt
embed-1.c
embed-5.c
../empty.h
files around (the first 4 from the patch, the last one empty file).

$ ..../clang --embed-dir=./embed-dir -std=c23 embed-1.c
embed-1.c:29:103: error: expected identifier
   29 | #if __has_embed (__FILE__ __limit__ (1) __prefix__ () suffix (1 / 0) __if_empty__ ((({{[0[0{0{0(0(0)1)1}1}]]}})))) != __STDC_EMBED_FOUND__
      |                                                                                                       ^
embed-1.c:29:139: error: expected value in expression
   29 | #if __has_embed (__FILE__ __limit__ (1) __prefix__ () suffix (1 / 0) __if_empty__ ((({{[0[0{0{0(0(0)1)1}1}]]}})))) != __STDC_EMBED_FOUND__
      |                                                                                                                                           ^
clang-19: /usr/src/llvm-embed/llvm-project/clang/lib/Lex/Lexer.cpp:369: size_t getSpellingSlow(const clang::Token&, const char*, const clang::LangOptions&, char*): Assertion `Length < Tok.getLength() && "NeedsCleaning flag set on token that didn't need cleaning!"' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.	Program arguments: /usr/src/llvm-embed/llvm-project/build/bin/clang-19 -cc1 -triple x86_64-unknown-linux-gnu -emit-obj -dumpdir a- -disable-free -clear-ast-before-backend -main-file-name embed-1.c -mrelocation-model pic -pic-level 2 -pic-is-pie -mframe-pointer=all -fmath-errno -ffp-contract=on -fno-rounding-math -mconstructor-aliases -funwind-tables=2 -target-cpu x86-64 -tune-cpu generic -debugger-tuning=gdb -fdebug-compilation-dir=/usr/src/gcc/gcc/testsuite/c-c++-common/cpp -fcoverage-compilation-dir=/usr/src/gcc/gcc/testsuite/c-c++-common/cpp -resource-dir /usr/src/llvm-embed/llvm-project/build/lib/clang/19 --embed-dir=./embed-dir -internal-isystem /usr/src/llvm-embed/llvm-project/build/lib/clang/19/include -internal-isystem /usr/local/include -internal-isystem /usr/lib/gcc/x86_64-redhat-linux/12/../../../../x86_64-redhat-linux/include -internal-externc-isystem /include -internal-externc-isystem /usr/include -std=c23 -ferror-limit 19 -fgnuc-version=4.2.1 -fskip-odr-check-in-gmf -fcolor-diagnostics -faddrsig -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o /tmp/embed-1-e7e2cc.o -x c embed-1.c
1.	embed-1.c:71:2: current parser token 'if'
 #0 0x0000000004b70cf4 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0x4b70cf4)
 #1 0x0000000004b7113c (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0x4b7113c)
 #2 0x0000000004b6e8b2 llvm::sys::RunSignalHandlers() (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0x4b6e8b2)
 #3 0x0000000004b706c9 (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0x4b706c9)
 #4 0x00007fcd1a23dab0 __restore_rt (/lib64/libc.so.6+0x3eab0)
 #5 0x00007fcd1a28dc7c __pthread_kill_implementation (/lib64/libc.so.6+0x8ec7c)
 #6 0x00007fcd1a23da06 gsignal (/lib64/libc.so.6+0x3ea06)
 #7 0x00007fcd1a227834 abort (/lib64/libc.so.6+0x28834)
 #8 0x00007fcd1a22775b _nl_load_domain.cold (/lib64/libc.so.6+0x2875b)
 #9 0x00007fcd1a2365b6 (/lib64/libc.so.6+0x375b6)
#10 0x000000000ac1eee8 (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0xac1eee8)
#11 0x000000000ac1f479 clang::Lexer::getSpelling(clang::Token const&, char const*&, clang::SourceManager const&, clang::LangOptions const&, bool*) (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0xac1f479)
#12 0x0000000005c0c189 (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0x5c0c189)
#13 0x000000000acbd864 clang::Preprocessor::getSpelling(clang::Token const&, llvm::SmallVectorImpl<char>&, bool*) const (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0xacbd864)
#14 0x000000000ac84a05 (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0xac84a05)
#15 0x000000000ac8791f clang::Preprocessor::EvaluateDirectiveExpression(clang::IdentifierInfo*&, clang::Token&, bool&, bool) (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0xac8791f)
#16 0x000000000ac87d76 clang::Preprocessor::EvaluateDirectiveExpression(clang::IdentifierInfo*&, bool) (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0xac87d76)
#17 0x000000000ac7a540 clang::Preprocessor::HandleIfDirective(clang::Token&, clang::Token const&, bool) (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0xac7a540)
#18 0x000000000ac71da6 clang::Preprocessor::HandleDirective(clang::Token&) (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0xac71da6)
#19 0x000000000ac2c80f clang::Lexer::LexTokenInternal(clang::Token&, bool) (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0xac2c80f)
#20 0x000000000ac29cd7 clang::Lexer::Lex(clang::Token&) (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0xac29cd7)
#21 0x00000000088fd17f (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0x88fd17f)
#22 0x000000000acbf2ee clang::Preprocessor::Lex(clang::Token&) (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0xacbf2ee)
#23 0x00000000088ff73e (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0x88ff73e)
#24 0x00000000088f392e clang::Parser::Initialize() (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0x88f392e)
#25 0x00000000088ef381 clang::ParseAST(clang::Sema&, bool, bool) (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0x88ef381)
#26 0x0000000005bbb588 clang::ASTFrontendAction::ExecuteAction() (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0x5bbb588)
#27 0x00000000058c0393 clang::CodeGenAction::ExecuteAction() (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0x58c0393)
#28 0x0000000005bbaed9 clang::FrontendAction::Execute() (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0x5bbaed9)
#29 0x0000000005add728 clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0x5add728)
#30 0x0000000005d5a162 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0x5d5a162)
#31 0x0000000000dd2d31 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0xdd2d31)
#32 0x0000000000dc4f9d (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0xdc4f9d)
#33 0x0000000000dc5492 clang_main(int, char**, llvm::ToolContext const&) (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0xdc5492)
#34 0x0000000000dfbcaf main (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0xdfbcaf)
#35 0x00007fcd1a228590 __libc_start_call_main (/lib64/libc.so.6+0x29590)
#36 0x00007fcd1a228649 __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x29649)
#37 0x0000000000dc43e5 _start (/usr/src/llvm-embed/llvm-project/build/bin/clang-19+0xdc43e5)
clang: error: unable to execute command: Aborted (core dumped)
clang: error: clang frontend command failed due to signal (use -v to see invocation)
clang version 19.0.0git (git@github.com:ThePhD/llvm-project.git a3615ed95b820546470a8407c2cfab2b89493dd1)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/src/llvm-embed/llvm-project/build/bin
Build config: +unoptimized, +assertions
clang: error: unable to execute command: Aborted (core dumped)
clang: note: diagnostic msg: Error generating preprocessed source(s).

jakubjelinek · 2024-06-14T08:42:09Z

Also wonder (for GCC the implementation is doing that too right now) if #embed on non-existent file needs to be a fatal error, for missing #include that is certainly the right thing, but perhaps #embed on non-existent/non-readable etc. file could be just treated as empty file after normal errors.

jakubjelinek · 2024-06-14T09:35:45Z

#if __has_embed("")
#endif
#if __has_embed(<>)
#endif
#embed ""
#embed <>

crashes as well in all 4 cases, but this is different from the earlier crash, incorrect parsing errors or runtime failures.

jakubjelinek · 2024-06-14T11:46:50Z

Also,

_Static_assert(sizeof(
#embed <single_byte.txt>
) ==
sizeof(unsigned char)
, ""
);

looks wrong to me. It also fails with clang -save-temps. I'd think sizeof (
98
) == sizeof (int).

AaronBallman · 2024-06-14T12:10:16Z

Also,
_Static_assert(sizeof(
#embed <single_byte.txt>
) ==
sizeof(unsigned char)
, ""
);
looks wrong to me. It also fails with clang -save-temps. I'd think sizeof ( 98 ) == sizeof (int).

Huh, well that's interesting. I thought there was a standard requirement that the values be in a type matching that of the embed resource element width, so because our element width is CHAR_BIT, the values should be of 8-bit integer type, but the standard only talks about the values of the integer constant expressions, not the type of those expressions. I can definitely see why int would make sense, too.

The choice here matters for things like constexpr initialization in C23, __attribute__((overloadable)), C++ overload resolution or template instantiation, etc.

@ThePhD -- do you have any input here as the feature author?

jakubjelinek · 2024-06-14T12:27:03Z

The preprocessor has no notion of types and there are no integer literal suffixes for unsigned char constants, so if the intent was to make the constants have unsigned char, the preprocessor (unless doing magic, including -E preprocessing) would need to replace
#embed of Hello with
(unsigned char)72,(unsigned char)101,(unsigned char)108,(unsigned char)108,(unsigned char)111
rather than
72,101,108,108,111

AaronBallman · 2024-06-14T12:37:25Z

The preprocessor has no notion of types and there are no integer literal suffixes for unsigned char constants, so if the intent was to make the constants have unsigned char, the preprocessor (unless doing magic, including -E preprocessing) would need to replace #embed of Hello with (unsigned char)72,(unsigned char)101,(unsigned char)108,(unsigned char)108,(unsigned char)111 rather than 72,101,108,108,111

Yeah, at one point I was doing exactly that. :-D But I think we should probably go with straight integer literal values instead of casted integer constant expressions; we can add a custom embed parameter if we want to give users more control over the type of the elements. So I think the correct fix here is to change:

EmbedExpr::EmbedExpr(const ASTContext &Ctx, SourceLocation Loc,
                     EmbedDataStorage *Data, unsigned Begin,
                     unsigned NumOfElements)
    : Expr(EmbedExprClass, Ctx.UnsignedCharTy, VK_PRValue, OK_Ordinary),

to use Ctx.IntTy instead of Ctx,UnsignedCharTy.

jakubjelinek · 2024-06-14T12:53:52Z

https://godbolt.org/z/WhjhPrYns
https://godbolt.org/z/6nE46GMce
show that @ThePhD original branch was handling them the way I expect.

AaronBallman · 2024-06-14T13:21:37Z

Thank you for all the excellent help @jakubjelinek, this has been really great feedback!

@Fznamznon -- I think the next steps for this should be to get the patch back to the point of passing CI without leaks, land the changes knowing there are still some issues to be worked out (as listed here and in #95222), and then start tackling those issues once the changes have landed.

ThePhD · 2024-06-14T16:22:15Z

Hi @AaronBallman!

As Jakub pointed out, the casting bit was something I did to give a consistent type to the literals. I dropped this wording for C in the specification but kept it in for C++ because they have more cases where they care deeply about what comes out (initializer lists, template argument deduction, constructors/conversions, etc.).

For the implementation I would recommend not casting for C as that is what the spec says right now. I can change that with a follow up paper I plan to do for #embed anyways for the C wording. This is not much of a material problem in C because, thankfully, there is nothing that can manifest this as a real problem, only test suite issues on precise semantics.

ThePhD · 2024-06-14T17:48:35Z

Also wonder (for GCC the implementation is doing that too right now) if #embed on non-existent file needs to be a fatal error, for missing #include that is certainly the right thing, but perhaps #embed on non-existent/non-readable etc. file could be just treated as empty file after normal errors.

Chiming in here, embed on a non-existent file should be an error. This is the intent. Because we provide a way to check if the file exists with __has_embed, we can already avoid a fatal error by just doing that check.

jakubjelinek · 2024-06-14T17:53:32Z

I'm not arguing about the error, just about whether it is a fatal error (in the sense that compilation stops immediately and no errors later on in the file are reported) vs. just a normal error (where it would diagnose it, then pretend the file is empty and so copy over if_empty tokens if any) and continue preprocessing and compilation, obviously with a non-zero error code from the compilation at the end. Yes, some extra errors might be reported because e.g. the source code syntactically wouldn't be able to deal with empty file, but perhaps in the common case it could help to users to find more issues in the same compilation.

JamesWidman · 2024-06-15T23:38:30Z

Yes, some extra errors might be reported because e.g. the source code syntactically wouldn't be able to deal with empty file, but perhaps in the common case it could help to users to find more issues in the same compilation.

This makes sense. My initial thought was that a missing embed-file would be like a missing header. But a missing header is treated as fatal because the lack of declarations provided by a header would lead to cascading errors (i.e., noise) against subsequent code that uses those declarations.

In contrast, it's not clear (to me) that the well-formedness of code following an ill-formed #embed would depend on e.g. specific array elements having specific values provided by the file, or on the containing array having a dimension greater than zero. If testing/surveys show that that kind of dependence is uncommon, then maybe a non-fatal error would be more appropriate (i.e., diagnose the missing file as an error, but keep processing the source file to try to find any additional errors).

Having said that...

Does the question of error category need to be resolved before #embed support is merged, or can the error category be changed after the merge?

ThePhD · 2024-06-16T00:25:38Z

I'm not arguing about the error, just about whether it is a fatal error (in the sense that compilation stops immediately and no errors later on in the file are reported) vs. just a normal error (where it would diagnose it, then pretend the file is empty and so copy over if_empty tokens if any) and continue preprocessing and compilation, obviously with a non-zero error code from the compilation at the end. Yes, some extra errors might be reported because e.g. the source code syntactically wouldn't be able to deal with empty file, but perhaps in the common case it could help to users to find more issues in the same compilation.

I think my opinion on this depends on what this actually means; forgive me for not understanding your distinction here.

The idea that this is a warning that will not stop compilation and my code which handles and empty file but was not expecting one just finishes building is bad. If you're referring to architecture that differentiates between "error that stops now" and "error that stops eventually", then sure.

If this is "a casual warning versus an actual error", then please don't do this.

eli-schwartz · 2024-06-16T04:15:07Z

I think my opinion on this depends on what this actually means; forgive me for not understanding your distinction here.

Pretty sure this is just a question of whether the compiler should:

error out immediately ("fatal error")
collect every error in the file and then error out ("normal error") at the end by saying "lucky you, your code has TWO problems. A broken #embed, and also a -Werror=deprecated-declarations violation" (or what have you)

It makes zero difference at this point I think. It's an error either way since builds always fail, so the precise length of the resulting diagnostic output can be freely changed on a whim post-merge, as and when the compiler developers change their mind about how helpful said diagnostics are.

AaronBallman · 2024-06-17T14:01:13Z

My original thinking was that this should behave like a missing #include but you're right, the situations aren't quite the same in terms of error recovery. About the only (non-pathological) situation I can think of where continuing to attempt the compilation rather than stopping immediately would be code like:

constexpr const char buffer[256] = {
#embed "not found" limit(256)
};
constexpr int i = buffer[0] + buffer[1]; // Or whatever

where the follow-on error will be about invalid initialization of a constexpr variable if buffer is a local variable but will compile cleanly if buffer is a global variable (due to static zero init). But that feels defensible enough.

So I think I'd be fine making it a non-fatal error (but still an error; a warning would be problematic).

AaronBallman · 2024-06-18T18:29:33Z

Hi @AaronBallman!

As Jakub pointed out, the casting bit was something I did to give a consistent type to the literals. I dropped this wording for C in the specification but kept it in for C++ because they have more cases where they care deeply about what comes out (initializer lists, template argument deduction, constructors/conversions, etc.).

For the implementation I would recommend not casting for C as that is what the spec says right now. I can change that with a follow up paper I plan to do for #embed anyways for the C wording. This is not much of a material problem in C because, thankfully, there is nothing that can manifest this as a real problem, only test suite issues on precise semantics.

I don't think it's acceptable for C and C++ to have different behavior here:

auto foo[] = {
#embed "foo"
};

where auto is deduced to int[N] in C and unsigned char[N] in C++. It needs to deduce to the same type -- we do not want the preprocessor expansion to semantically differ depending on language mode.

Given that the conceptual model for #embed is that it produces a list of comma delimited values, and that limit is a way for the user to specify precisely how many such values are emitted, I think it makes more sense to use int as the element type as in C. Emitting something like 0L and making the element type long wouldn't be a burden, but because there's no suffix for unsigned character types, the preprocessor has to emit tokens for a cast operation. That means we're emitting five tokens per byte in the file ((, unsigned, char, ), integer literal), which is a lot of extra parsing work for significantly large files.

IMO, if users need fine-grained control over the element type, that should be handled via an explicitly written embed parameter; the preprocessor should remain as generally type agnostic as it is today.

jakubjelinek · 2024-06-18T18:45:55Z

I agree. If you want unsigned char constants, start with standardizing some integer literal suffix like UC for unsigned char (or whatever else), then add a new #embed parameter which allows e.g. to supply a suffix for all the constant literals,
so one could as well use U or UL or ULL or uwb etc. or UDLs etc. And integrated preprocessor could choose if the new parameter is something it can handle efficiently, or worst case if it can't fall back into expanding it into integer literal sequence with those suffixes.

ThePhD · 2024-06-18T18:53:47Z

I'll bring this up when this paper is discussed, probably at end-of-year or start-of-next-year, whichever meeting I can attend.

Though, FWIW, this was an explicit user request at-the-time, given std::initializer_list and constructor behaviors in C++. It was integrated into the paper quietly with no objections after discussion. I don't mind bringing this thread up and making change in the other direction and utilizing embed parameters to do it, but I should point out that the last time I brought something slightly similar (vendor::type(int32_t)) it did not go over well. Maybe something that works with UDLs in C++ would be better (though, where does that leave C?).

But sure; for C++ implement it as naked literals and use the force of implementation as a reason to help me justify the way you want it when this comes up for discussion in the C++ Committee, as I'm sure it inevitably will.

cor3ntin · 2024-06-18T18:59:59Z

Whether #embed produces token or expression also affects weird edge cases like

void* f()  __attribute__((availability(macos,introduced=
#embed "digit.txt"
)));

and

struct S {
virtual void f() = 
#embed "zero.txt"
; 
};

And that needs to behave similarly in C and C++, such that if the say wording says that embed produces integer literal tokens (with no suffixes) (making the above nonsense well formed), C++ needs to follow suite

jakubjelinek · 2024-06-18T19:11:28Z

I'll bring this up when this paper is discussed, probably at end-of-year or start-of-next-year, whichever meeting I can attend.

Can't you attend it just remotely next week? Or does the paper author need to be physically present in CWG for the discussion of the paper?

jakubjelinek · 2024-06-18T19:14:51Z

OT, more complete analysis of the bugs on the clang side (at least IMHO) and implementation differences between this branch and GCC patch and the branch on godbolt are in the description of my gcc patch:
https://gcc.gnu.org/pipermail/gcc-patches/2024-June/655012.html
because
#68620 (comment)
wasn't too detailed.

The following patch implements the C23 N3017 "#embed - a scannable, tooling-friendly binary resource inclusion mechanism" paper. The implementation is intentionally dumb, in that it doesn't significantly speed up compilation of larger initializers and doesn't make it possible to use huge #embeds (like several gigabytes large, that is compile time and memory still infeasible). There are 2 reasons for this. One is that I think like it is implemented now in the patch is how we should use it for the smaller #embed sizes, dunno with which boundary, whether 32 bytes or 64 or something like that, certainly handling the single byte cases which is something that can appear anywhere in the source where constant integer literal can appear is desirable and I think for a few bytes it isn't worth it to come up with something smarter and users would like to e.g. see it in -E readably as well (perhaps the slow vs. fast boundary should be determined by command line option). And the other one is to be able to more easily find regressions in behavior caused by the optimizations, so we have something to get back in git to compare against. I'm definitely willing to work on the optimizations (likely introduce a new CPP_* token type to refer to a range of libcpp owned memory (start + size) and similarly some tree which can do the same, and can be at any time e.g. split into 2 subparts + say INTEGER_CST in between if needed say for const unsigned char d[] = { #embed "2GB.dat" prefix (0, 0, ) suffix (, [0x40000000] = 42) }; still without having to copy around huge amounts of data; STRING_CST owns the memory it points to and can be only 2GB in size), but would like to do that incrementally. And would like to first include some extensions also not included in this patch, like gnu::offset (off) parameter to allow to skip certain constant amount of bytes at the start of the files, plus gnu::base64 ("base64_encoded_data") parameter to add something which can store more efficiently large amounts of the #embed data in preprocessed source. I've been cross-checking all the tests also against the LLVM implementation llvm/llvm-project#68620 which has been for a few hours even committed to LLVM trunk but reverted afterwards. LLVM now has the support committed and I admit I haven't rechecked whether the behavior on the below mentioned spots have been fixed in it already or not yet. The patch uses --embed-dir= option that clang plans to add above and doesn't use other variants on the search directories yet, plus there are no default directories at least for the time being where to search for embed files. So, #embed "..." works if it is found in the same directory (or relative to the current file's directory) and #embed "/..." or #embed </...> work always, but relative #embed <...> doesn't unless at least one --embed-dir= is specified. There is no reason to differentiate between system and non-system directories, so we don't need -isystem like counterpart, perhaps -iquote like counterpart could be useful in the future, dunno what else. It has --embed-directory=dir and --embed-directory dir as aliases. There are some differences beyond clang ICEs, so I'd like to point them out to make sure there is agreement on the choices in the patch. They are also mentioned in the comments of the llvm pull request. The most important is that the GCC patch (as well as the original thephd.dev LLVM branch on godbolt) expands #embed (or acts as if it is expanded) into a mere sequence of numbers like 123,2,35,26 rather then what clang effectively treats as (unsigned char)123,(unsigned char)2,(unsigned char)35,(unsigned char)26 but only does that when using integrated preprocessor, not when using -save-temps where it acts as GCC. JeanHeyd as the original author agrees that is how it is currently worded in C23. Another difference (not tested in the testsuite, not sure how to check for effective target /dev/urandom nor am sure it is desirable to check that during testsuite) is how to treat character devices, named pipes etc. (block devices are errored on). The original paper uses /dev/urandom in various examples and seems to assume that unlike regular files the devices aren't really cached, so #embed </dev/urandom> limit(1) prefix(int a = ) suffix(;) #embed </dev/urandom> limit(1) prefix(int b = ) suffix(;) usually results in a != b. That is what the godbolt thephd.dev branch implements too and what this patch does as well, but clang actually seems to just go from st.st_size == 0, ergo it must be zero-sized resource and so just copies over if_empty if present. It is really questionable what to do about the character devices/named pipes with __has_embed, for regular files the patch doesn't read anything from them, relies on st.st_size + limit for whether it is empty or non-empty. But I don't know of a way to check if read on say a character device would read anything or not (the </dev/null> limit (1) vs. </dev/zero> limit (1) cases), and if we read something, that would be better cached for later because #embed later if it reads again could read no further data even when it first read something. So, the patch currently for __has_embed just always returns 2 on the non-regular files, like the thephd.dev branch does as well and like the clang pull request as well. A question is also what to do for gnu::offset on the non-regular files even for #embed, those aren't seekable and do we want to just read and throw away the offset bytes each time we see it used? clang also chokes on the #if __has_embed (__FILE__ __limit__ (1) __prefix__ () suffix (1 / 0) \ __if_empty__ ((({{[0[0{0{0(0(0)1)1}1}]]}})))) != __STDC_EMBED_FOUND__ #error "__has_embed fail" #endif in embed-1.c, but thephd.dev branch accepts it and I don't see why it shouldn't, (({{[0[0{0{0(0(0)1)1}1}]]}}))) is a balanced token sequence and the file isn't empty, so it should just be parsed and discarded. clang also IMHO mishandles const unsigned char w[] = { #embed __FILE__ prefix([0] = 42, [15] =) limit(32) }; but again only without -save-temps, seems like it treats it as [0] = 42, [15] = (99,111,110,115,116,32,117,110,115,105,103,110,101,100, 32,99,104,97,114,32,119,91,93,32,61,32,123,10,35,101,109,98) rather than [0] = 42, [15] = 99,111,110,115,116,32,117,110,115,105,103,110,101,100, 32,99,104,97,114,32,119,91,93,32,61,32,123,10,35,101,109,98 and warns on it for -Wunused-value and just compiles it as [0] = 42, [15] = 98 And also void foo (int, int, int, int); void bar (void) { foo ( #embed __FILE__ limit (4) prefix (172 + ) suffix (+ 2) ); } is treated as 172 + (118, 111, 105, 100) + 2 rather than 172 + 118, 111, 105, 100 + 2 which clang -save-temps or GCC treats it like, so results in just one argument passed rather than 4. if (!strstr ((const char *) magna_carta, "imprisonétur")) abort (); in the testcase fails as well, but in that case calling it in gdb succeeds: p ((char *(*)(char *, char *))__strstr_sse2) (magna_carta, "imprisonétur") $2 = 0x555555558d3c <magna_carta+11564> "imprisonétur aut disseisiátur"... so I guess they are just trying to constant evaluate strstr and do it incorrectly. They started with making the optimizations together in the initial patch set, so they don't have the luxury to compare if it is just because of the optimization they are trying to do or because that is how the feature works for them. At least unless they use -save-temps for now. There is also different behavior between clang and gcc on -M or other dependency generating options. Seems clang includes the __has_embed searched files in dependencies, while my patch doesn't. But so does clang for __has_include and GCC doesn't. Emitting a hard dependency on some header just because there was __has_include/__has_embed for it seems wrong to me, because (at least when properly written) the source likely doesn't mind if the file is missing, it will do something else, so a hard error from make because of it doesn't seem right. Does make have some weaker dependencies, such that if some file can be remade it is but if it doesn't exist, it isn't fatal? I wonder whether #embed <non-existent-file> really needs to be fatal or whether we could simply after diagnosing it pretend the file exists and is empty. For #include I think fatal errors make tons of sense, but perhaps for #embed which is more localized we'd get better error reporting if we didn't bail out immediately. Note, both GCC and clang currently treat those as fatal errors. clang also added -dE option which with -E instead of preprocessing the #embed directives keeps them as is, but the preprocessed source then isn't self-contained. That option looks more harmful than useful to me. Also, it isn't clear to me from C23 whether it is possible to have __has_include/__has_c_attribute/__has_embed expressions inside of the limit #embed/__has_embed argument. 6.10.3.2/2 says that defined should not appear there (and the patch diagnoses it and testsuite tests), but for __has_include/__has_embed etc. 6.10.1/11 says: "The identifiers __has_include, __has_embed, and __has_c_attribute shall not appear in any context not mentioned in this subclause." If that subclause in that case means 6.10.1, then it presumably shouldn't appear in #embed in 6.10.3, but __has_embed is in 6.10.1... But 6.10.3.2/3 says that it should be parsed according to the 6.10.1 rules. Haven't included tests like #if __has_embed (__FILE__ limit (__has_embed (__FILE__ limit (1)))) or #embed __FILE__ limit (__has_include (__FILE__)) into the testsuite because of the doubts but I think the patch should handle those right now. The reason I've used Magna Carta text in some of the testcases is that I hope it shouldn't be copyrighted after the centuries and I'd strongly prefer not to have binary blobs in git after the xz backdoor lesson and wanted something larger which doesn't change all the time. Oh, BTW, I see in C23 draft 6.10.3.2 in Example 4 if (f_source == NULL); return 1; (note the spurious semicolon after closing paren), has that been fixed already? Like the thephd.dev and clang implementations, the patch always macro expands the whole #embed and __has_embed directives except for the embed keyword. That is most likely not what C23 says, my limited understanding right now is that in #embed one needs to parse the whole directive line with macro expansion disabled and check if it satisfies the grammar, if not, the whole directive is macro expanded, if yes, only the limit parameter argument is macro expanded and the prefix/suffix/if_empty arguments are maybe macro expanded when actually used (and not at all if unused). And I think __has_embed macro expansion has conflicting rules. 2024-09-12 Jakub Jelinek <jakub@redhat.com> PR c/105863 libcpp/ * include/cpplib.h: Implement C23 N3017 #embed - a scannable, tooling-friendly binary resource inclusion mechanism paper. (struct cpp_options): Add embed member. (enum cpp_builtin_type): Add BT_HAS_EMBED. (cpp_set_include_chains): Add another cpp_dir * argument to the declaration. * internal.h (enum include_type): Add IT_EMBED. (struct cpp_reader): Add embed_include member. (struct cpp_embed_params_tokens): New type. (struct cpp_embed_params): New type. (_cpp_get_token_no_padding): Declare. (enum _cpp_find_file_kind): Add _cpp_FFK_EMBED and _cpp_FFK_HAS_EMBED. (_cpp_stack_embed): Declare. (_cpp_parse_expr): Change return type to cpp_num_part instead of bool, change second argument from bool to const char * and add third argument. (_cpp_parse_embed_params): Declare. * directives.cc (DIRECTIVE_TABLE): Add embed entry. (end_directive): Don't call skip_rest_of_line for T_EMBED directive. (_cpp_handle_directive): Return 2 rather than 1 for T_EMBED in directives-only mode. (parse_include): Don't Call check_eol for T_EMBED directive. (skip_balanced_token_seq): New function. (EMBED_PARAMS): Define. (enum embed_param_kind): New type. (embed_params): New variable. (_cpp_parse_embed_params): New function. (do_embed): New function. (do_if): Adjust _cpp_parse_expr caller. (do_elif): Likewise. * expr.cc (parse_defined): Diagnose defined in #embed or __has_embed parameters. (_cpp_parse_expr): Change return type to cpp_num_part instead of bool, change second argument from bool to const char * and add third argument. Adjust function comment. For #embed/__has_embed parameters add an artificial CPP_OPEN_PAREN. Use the second argument DIR directly instead of string literals conditional on IS_IF. For #embed/__has_embed parameter, stop on reaching CPP_CLOSE_PAREN matching the artificial one. Diagnose negative or too large embed parameter operands. (num_binary_op): Use #embed instead of #if for diagnostics if inside #embed/__has_embed parameter. (num_div_op): Likewise. * files.cc (struct _cpp_file): Add limit member and embed bitfield. (search_cache): Add IS_EMBED argument, formatting fix. Skip over files with different file->embed from the argument. (find_file_in_dir): Don't call pch_open_file if file->embed. (_cpp_find_file): Handle _cpp_FFK_EMBED and _cpp_FFK_HAS_EMBED. (read_file_guts): Formatting fix. (has_unique_contents): Ignore file->embed files. (search_path_head): Handle IT_EMBED type. (_cpp_stack_embed): New function. (_cpp_get_file_stat): Formatting fix. (cpp_set_include_chains): Add embed argument, save it to pfile->embed_include and compute lens for the chain. * init.cc (struct lang_flags): Add embed member. (lang_defaults): Add embed initializers. (cpp_set_lang): Initialize CPP_OPTION (pfile, embed). (builtin_array): Add __has_embed entry. (cpp_init_builtins): Predefine __STDC_EMBED_NOT_FOUND__, __STDC_EMBED_FOUND__ and __STDC_EMBED_EMPTY__. * lex.cc (cpp_directive_only_process): Handle #embed. * macro.cc (cpp_get_token_no_padding): Rename to ... (_cpp_get_token_no_padding): ... this. No longer static. (builtin_has_include_1): New function. (builtin_has_include): Use it. Use _cpp_get_token_no_padding instead of cpp_get_token_no_padding. (builtin_has_embed): New function. (_cpp_builtin_macro_text): Handle BT_HAS_EMBED. gcc/ * doc/cppdiropts.texi (--embed-dir=): Document. * doc/cpp.texi (Binary Resource Inclusion): New chapter. (__has_embed): Document. * doc/invoke.texi (Directory Options): Mention --embed-dir=. * gcc.cc (cpp_unique_options): Add %{-embed*}. * genmatch.cc (main): Adjust cpp_set_include_chains caller. * incpath.h (enum incpath_kind): Add INC_EMBED. * incpath.cc (merge_include_chains): Handle INC_EMBED. (register_include_chains): Adjust cpp_set_include_chains caller. gcc/c-family/ * c.opt (-embed-dir=): New option. (-embed-directory): New alias. (-embed-directory=): New alias. * c-opts.cc (c_common_handle_option): Handle OPT__embed_dir_. gcc/testsuite/ * c-c++-common/cpp/embed-1.c: New test. * c-c++-common/cpp/embed-2.c: New test. * c-c++-common/cpp/embed-3.c: New test. * c-c++-common/cpp/embed-4.c: New test. * c-c++-common/cpp/embed-5.c: New test. * c-c++-common/cpp/embed-6.c: New test. * c-c++-common/cpp/embed-7.c: New test. * c-c++-common/cpp/embed-8.c: New test. * c-c++-common/cpp/embed-9.c: New test. * c-c++-common/cpp/embed-10.c: New test. * c-c++-common/cpp/embed-11.c: New test. * c-c++-common/cpp/embed-12.c: New test. * c-c++-common/cpp/embed-13.c: New test. * c-c++-common/cpp/embed-14.c: New test. * c-c++-common/cpp/embed-25.c: New test. * c-c++-common/cpp/embed-26.c: New test. * c-c++-common/cpp/embed-dir/embed-1.inc: New test. * c-c++-common/cpp/embed-dir/embed-3.c: New test. * c-c++-common/cpp/embed-dir/embed-4.c: New test. * c-c++-common/cpp/embed-dir/magna-carta.txt: New test. * gcc.dg/cpp/embed-1.c: New test. * gcc.dg/cpp/embed-2.c: New test. * gcc.dg/cpp/embed-3.c: New test. * gcc.dg/cpp/embed-4.c: New test. * g++.dg/cpp/embed-1.C: New test. * g++.dg/cpp/embed-2.C: New test. * g++.dg/cpp/embed-3.C: New test.

cor3ntin added the c23 label Oct 9, 2023

HazardyKnusperkeks reviewed Oct 9, 2023

View reviewed changes

clang/lib/Format/TokenAnnotator.cpp Outdated Show resolved Hide resolved

clang/lib/Format/TokenAnnotator.cpp Outdated Show resolved Hide resolved

clang/lib/Format/TokenAnnotator.cpp Outdated Show resolved Hide resolved

jefftrull mentioned this pull request Oct 10, 2023

Support #embed boostorg/wave#131

Open

✨ [Sema, Driver, Lex, Frontend] Implement naive #embed for C23 and C+…

7050c93

…+26. 🛠 [Frontend] Ensure commas inserted by #embed are properly serialized to output

h-vetinari force-pushed the thephd/embed-speed branch from adc9737 to 5956fc2 Compare October 14, 2023 09:29

✨ Speedy #embed implementation

6a7a4c9

⚡ [Lex] Better reservations for improved performance/memory usage. 🛠 [Lex, Frontend] Remove comma hardcoding since we are servicing a full file apply suggestions from git-clang-format

h-vetinari force-pushed the thephd/embed-speed branch from 5f5d3a6 to 6a7a4c9 Compare October 14, 2023 10:15

ThePhD changed the title ~~[Sema, Lex, Parse] Preprocessor embed in C and C++ (and Obj-C and Obj-C++ by-proxy)~~ ✨ [Sema, Lex, Parse] Preprocessor embed in C and C++ (and Obj-C and Obj-C++ by-proxy) Oct 19, 2023

ThePhD mentioned this pull request Oct 21, 2023

✨ Allow #embed file replacement and preserve eol-tokens compiler-explorer/compiler-explorer#5600

Merged

MaskRay reviewed Oct 21, 2023

View reviewed changes

clang/include/clang/Driver/Options.td Outdated Show resolved Hide resolved

vitalybuka mentioned this pull request Jun 12, 2024

Revert "✨ [Sema, Lex, Parse] Preprocessor embed in C and C++ (and Obj-C and Obj-C++ by-proxy)" #95299

Merged

vitalybuka added a commit that referenced this pull request Jun 12, 2024

Revert "✨ [Sema, Lex, Parse] Preprocessor embed in C and C++ (and Obj…

682d461

…-C and Obj-C++ by-proxy)" (#95299) Reverts #68620 Introduce or expose a memory leak and UB, see #68620

Fznamznon mentioned this pull request Jun 17, 2024

Reland [clang][Sema, Lex, Parse] Preprocessor embed in C and C++ #95802

Merged

✨ [Sema, Lex, Parse] Preprocessor embed in C and C++ (and Obj-C and Obj-C++ by-proxy) #68620

✨ [Sema, Lex, Parse] Preprocessor embed in C and C++ (and Obj-C and Obj-C++ by-proxy) #68620

Conversation

ThePhD commented Oct 9, 2023 • edited Loading

llvmbot commented Oct 9, 2023 • edited Loading

github-actions bot commented Oct 9, 2023 • edited Loading

AaronBallman commented Oct 9, 2023

h-vetinari commented Oct 9, 2023

jyknight commented Oct 9, 2023

shafik commented Oct 10, 2023

h-vetinari commented Oct 11, 2023

AaronBallman commented Oct 11, 2023

h-vetinari commented Oct 12, 2023

h-vetinari commented Oct 14, 2023 • edited Loading

Footnotes

h-vetinari commented Oct 20, 2023

h-vetinari commented Oct 25, 2023

Fznamznon commented Jun 12, 2024

jakubjelinek commented Jun 14, 2024 • edited Loading

jakubjelinek commented Jun 14, 2024

jakubjelinek commented Jun 14, 2024

jakubjelinek commented Jun 14, 2024

AaronBallman commented Jun 14, 2024

jakubjelinek commented Jun 14, 2024

AaronBallman commented Jun 14, 2024

jakubjelinek commented Jun 14, 2024

AaronBallman commented Jun 14, 2024

ThePhD commented Jun 14, 2024

ThePhD commented Jun 14, 2024

jakubjelinek commented Jun 14, 2024

JamesWidman commented Jun 15, 2024 • edited Loading

ThePhD commented Jun 16, 2024

eli-schwartz commented Jun 16, 2024 • edited Loading

AaronBallman commented Jun 17, 2024

AaronBallman commented Jun 18, 2024

jakubjelinek commented Jun 18, 2024

ThePhD commented Jun 18, 2024

cor3ntin commented Jun 18, 2024 • edited Loading

jakubjelinek commented Jun 18, 2024

jakubjelinek commented Jun 18, 2024

ThePhD commented Oct 9, 2023 •

edited

Loading

llvmbot commented Oct 9, 2023 •

edited

Loading

github-actions bot commented Oct 9, 2023 •

edited

Loading

h-vetinari commented Oct 14, 2023 •

edited

Loading

jakubjelinek commented Jun 14, 2024 •

edited

Loading

JamesWidman commented Jun 15, 2024 •

edited

Loading

eli-schwartz commented Jun 16, 2024 •

edited

Loading

cor3ntin commented Jun 18, 2024 •

edited

Loading