Skip to content
This repository has been archived by the owner on Dec 14, 2021. It is now read-only.

Add Typescript support #2

Open
bheklilr opened this issue Feb 6, 2019 · 33 comments
Open

Add Typescript support #2

bheklilr opened this issue Feb 6, 2019 · 33 comments
Labels
bindings request enhancement New feature or request

Comments

@bheklilr
Copy link

bheklilr commented Feb 6, 2019

I would be interested in adding typescript support to SourcetrailDB. This would also lay the groundwork for JavaScript support as well.

Specifically, I would like to ensure that react code is supported as well.

@egraether
Copy link
Contributor

Sounds interesting! The first thing you need to look into is how to collect all source information from TypeScript code. Is there already a library for generating and iterating an Abstract Syntax Tree? What information does it provide? Definitions of types, variables, functions and where they are referenced, including their exact locations in the source code? Once you have a prototype for that, we can think about storing everything with SourcetrailDB.

@bheklilr
Copy link
Author

bheklilr commented Feb 6, 2019

Hello Eberhard!

I will have to do some research to answer all of your questions, but I know that Typescript already has a language service that adheres to the Language Service Protocol, which is what I would probably start with. Alternatively, the typescript compiler does have an API for parsing and working with the AST. It looks like that it understands the React code syntax natively as well, from a quick test, so that's awesome. The API for working with the AST definitely can extract source code positions, types, etc. It's pretty thorough.

The hard part will probably be the interop with C++ and Sourcetrail. It may be as simple as pointing C++ at an external tool, and parsing the output. That would probably be slow to run, but fast to implement. I can start looking into that over the next couple days, with this weekend being a good time for me to really dive into it.

@egraether
Copy link
Contributor

Sounds good to me!

Note that we are using SWIG for the Python binding used in our Python Indexer. The SWIG page states that JavaScript is also supported for bindings, which might work for you. Anyways, let's worry about this later.

We had a quick look at the Language Server Protocol some time ago and were not sure if it supplies all necessary information. It's usage also seems limited to doing single requests, which is not ideal for Sourcetrail indexing, but can work if the indexer is written like a crawler. Anyways, we never took a deep look into LSP so far.

@bheklilr
Copy link
Author

bheklilr commented Feb 6, 2019

I don't know too many details of the LSP, I just knew it was what VSCode and a couple of other editors use to easily add language support for things like renaming, jumping around the code, etc. If it doesn't provide enough information then I can understand why you would have gone with a different solution.

I can confirm experimentally that the typescript library that is usable by Javascript to compile Typescript code comes with a wealth of information, including types, source code positions, expressions, and more. It should be more than enough for Sourcetrail to build an index.

@mlangkabel
Copy link
Contributor

@bheklilr, did you make any progress on this one? Otherwise I would start working on making the SourcetrailDB interface callable from Javascript.

@mlangkabel mlangkabel added the enhancement New feature or request label Feb 11, 2019
@bheklilr
Copy link
Author

@mlangkabel I looked more into the Typescript side of things (just research) to see how to extract the index data from an entire project, and I'm still working on that. I didn't get as much time this weekend to work on it as I would have liked, but there still has been some progress. I'll try to get some of my notes committed tonight.

@bheklilr
Copy link
Author

I've got the skeleton of a project set up following https://github.com/CoatiSoftware/SourcetrailPythonIndexer, it's currently over in my repos at https://github.com/bheklilr/SourcetrailTypescriptIndexer.

I've gotta get back to work right now, but I'll update it later tonight with more info on what I've found out. I also discovered that I won't be able to really do any development on my work laptop (even though this is justifiably work related) due to needing a C++ compiler to do the FFI between node and the DB library. It looks like it won't be too difficult to get working using https://github.com/node-ffi/node-ffi, I can just load a .dll or .so and will just have to build the interop layer between JS and SourcetrailDB.

I'm planning on just building this project in JS, even though it's for TS. It'll just simplify the build process, since typescript just compiles to JS. I don't expect this to be a particularly large project after looking at the python code.

@mlangkabel
Copy link
Contributor

Thanks for the update!

For creating an interop layer for Python we are using SWIG. So the SWIG interface definition already exists. This morning I briefly looked into generating SWIG based Javascript bindings http://www.swig.org/Doc3.0/Javascript.html. It should be possible, but when generating that layer one has to specify a desired target engine (-jsc, -v8, or -node).

I guess, that the node-ffi just allows to generate bindings that work with the node engine, but that would also be fine. One more thing to watch out for is the licensing of those external dependencies and tools that we use when creating those extensions: For node-ffi it is looking good :)

@bheklilr
Copy link
Author

@mlangkabel I'll be sure to go through and check the licensing on the dependencies. Is there any particular restriction I need to keep in mind?

I'll have to evaluate which route will be best for developing the bindings. If I remember correctly SWIG is more of an auto-generated set of bindings that get compiled into an extension module (at least for Python), while node-ffi dynamically calls into a shared lib at runtime. There are probably more performance concerns there.

As for which JS engine to use, it'll certainly be node.

@mlangkabel
Copy link
Contributor

For licensing: It is just that the used libraries and tools need to be compatible with the Apache license of this project.

I had a further look at the swig based node bindings and it seems that it is not as easy to setup as expected. It looks like the swig support for more recent versions of node is broken.

@LouisStAmour
Copy link

I've bindings now for TypeScript or Node.js at https://github.com/LouisStAmour/sourcetraildb -- not published to NPM, not much in the way of unit tests, but "it works!" if you've CMake installed. Tested only with VS Build tools installed on Windows, so no idea how portable the CMake instructions are. It's possible file paths need adjustment.

As to the code itself ... it could be cleaned up, there's a lot of copy and paste in the CPP code, and no guarantees that the TypeScript declaration actually matches the CPP implementation when ideally I'd code-generate everything using a custom script from an AST based on the original CPP SDK or similar. Or I'd use the TS to generate the CPP bindings somehow, instead of manually doing tests for values and setting errors.

Additionally, I'm thinking I'd want to change the wrapper or TS API from returning success booleans to instead returning void and throwing exceptions. It's more the pattern expected of TS code, I think, and it's annoying to have to use wrapper code to fail hard on an error when an Exception would do it for you. I might do it upstream, as it was a bit annoying to not be able to use CPP's excellent exception support and instead return std::pairs with success booleans for conversion functions.

Finally, I'm not convinced the TS API works exactly the way I want it to. I tried to simplify creating NameElement objects with a constructor that can take just a name, or a set of 3 parameters to include prefix, name, postfix. But it still feels odd to create both NameHierarchy and NameElement objects, ideally there'd be some kind of overload so you could do simpler use cases more easily. You can see the ported CPP example in the form of cpp_example.ts in the repo, I'm a bit tired after spending all weekend researching the best ways to write native code for Node and ending up at CMake.js with node-addons-api over N-API as the best way, with a TypeScript declaration for ease-of-use.

If you're looking for native indexing support in SourceTrail for TypeScript, you won't find it here yet. Maybe I'll have something proof-of-concept up by next weekend, but writing the bindings was supposed to be the easy part. Now I need to map either TypeScript or Babel types (probably Babel's because it's more compatible with raw source generally) to the various APIs and Kinds used by this API. Using a tree walker with Babel, I suspect, will actually be easier than using this API correctly with SourceTrail -- it might take a few iterations of trial-and-error before things turn out exactly as they should...

For now, given the beta nature of these bindings, I don't feel comfortable publishing them to NPM yet, just point your package.json to the git repo if you want to try it. Once I've got things working for TypeScript, I'd like to turn my attention to C# support via Roslyn's AST next, which might be as easy as compiling the C++ code using a managed C++ compiler, or it might require SWIG bindings, we'll see. I've worked at a couple companies recently on a JS/TS/C# heavy stack, and I miss Sourcetrail from when I worked on primarily Java codebases, so this is my way of trying to get that back! JS or TS bindings will probably prove very flexible, language-wise, as any AST listed at https://astexplorer.net/ has JS bindings and it's a pretty long list thanks to the number of languages that compile to JavaScript or have semantic support for JS IDEs like VS Code including, in alphabetical order: CSS, GraphQL, Graphviz, Handlebars (templating), HTML, ICU MessageFormat, JavaScript (of course), JSON, Lua, Markdown, MDX, PHP, Regexp, Scala, Solidity, SQL, WebIDL, and YAML. Now, how much value will you get out of Markdown support or Regexp with this kind of tool? Probably not much. But it's possible. Me, I'll be trying to map JSX or TSX to JS symbols, but outside of that I'm not sure how useful SourceTree would integrate with a document-based syntax like HTML without importing an entire ontology of JS DOM node types from MDN or W3C specs and even then... the dynamic nature of most template languages (and JS itself) means it can be hard to know with 100% certainty what's going on. But it's a step in the right direction...

@mlangkabel
Copy link
Contributor

mlangkabel commented Nov 4, 2019

@LouisStAmour: Wow, you are awesome! Thank you for making this! We've just checked out your repository and got the example to run without any issue. So this is already looking really nice!

Unfortunately we don't have a typescript expert on our team, so we cannot really give you any advice on the Typescript side of things, but if you say that exceptions are the Typescript way, go for it! It should be very straight forward to turn the bool that the Cxx code returned into an exception on the Typescript side.

When you start creating the actual Typescript indexer, please take a look at our SourcetrailPythonIndexer repository. The SourcetrailPythonIndexer can be run from within Sourcetrail via the CustomCommand SourceGroup. If you want, you can create a command line API for the Typescript indexer that is similar to the SourcetrailPythonIndexer, so you can already use the your indexer on Typescript code without having to wait for a new version of Sourcetrail that ships your code ;)

For the SourcetrailPythonIndexer we started by creating a deep indexer, that traverses the AST and for every symbol that is referenced (e.g. the line Client::send_signal(); in the cpp_example) it would stop and try to find out the exact name hierarchy of the method that is called here (it gets more complex for non-static methods that are called on variables).
But deep indexing is a lot of work and it is slow at indexing time. So right now we are expanding the SourcetrailPythonIndexer by a --shallow mode where the indexer traverses the whole AST. For each definition it records a symbol but for references it only calls recordReferenceToUnsolvedSymhol() (your wrapper already provides that function). Sourcetrail already provides an option in the preferences (called Add ambiguous edges for unsolved references) that tries to resolve the target symbols of those edges by name. So you could start with the shallow mode, if you think that that's easier.

Right now the Typescript bindings for SourcetrailDB are located on a separate repository. We discussed this internally and we would really like to the main SourcetrailDB repository. So if you want to, you could

  • fork this repo,
  • move the Typescript wrapper code to a branch on the fork and
  • add an option to the main CMakeLists file to enable/disable building the Typescript wrapper (exactly like it is done for the python bindings).
  • If you need help with anything, just open up an issue on your fork and we will try to assist as good as possible.
  • Once everything is set up we would be glad to accept your pull request and
  • merge the fork's branch back to this repository.

With that setup we would be able to keep the Typescript bindings up to date with all upcoming changes to the database.

@LouisStAmour
Copy link

LouisStAmour commented Nov 4, 2019

Fantastic, thanks. I'll look at doing that, though I might not have time until later in the week.

In regards to the deep introspection, I'm pretty sure Babel's AST is relatively shallow, more like a parser, as it works file-by-file. It has the benefit of understanding the most real-world source code, though, because most code goes through Babel's transformations and sometimes Babel is used to transform code directly to what the browser will run, though usually there are additional steps.

TypeScript's compiler, on the other hand, goes through an entire project and builds out incredible detail on what symbols reference which types, where they were defined, etc., some of which is mentioned here https://levelup.gitconnected.com/writing-a-custom-typescript-ast-transformer-731e2b0b66e6 and so from an ideal perspective, you'd chain together Babel to TypeScript with source mappings to preserve the lookup from the transformed Babel code back to the original TS code pre-transformation. The catch is that if the symbols don't match one-to-one, there will likely be extra code in the TypeScript version than was present in the initial Babel one. And given how many Babel transformations there are to support it's possible folks might have to pick one or the other to start.

Typescript's compiler is not at all extensible, but it does have JSX support: https://www.typescriptlang.org/docs/handbook/jsx.html

Here's further documentation on the TypeScript Binder: https://basarat.gitbooks.io/typescript/docs/compiler/binder.html

I'll probably need to read and re-read the TypeScript binder docs to build the indexer, and so the first projects we'd support would be ones that the tsc command is set up for, whether TS or JS. (Despite the name, the TypeScript Compiler is great at improving JavaScript semantics for IDEs also...)

@LouisStAmour
Copy link

LouisStAmour commented Nov 9, 2019

I figured the hard part was interfacing with the compiler rather than refactoring the node bindings to merge with this repo. So I got started on an indexer command line tool, using ts-command-line and some compiler code samples. No readme yet, as it's still setup for local development (package.json points to a local path for the sourcetraildb module) and it's not yet complete. But I figured I'd share where I'm at. It looks like TypeScript's compiler prefers getting all project files at once rather than going one-by-one, and has a compiler mode where it's passed a single tsconfig.json path or goes traversing up parent directories until it finds one. The tsc compiler's options, including files to include and files to exclude, are all specified within tsconfig.json and if one isn't provided, one is effectively internally created from the command line options. Given this convention, I'm not yet sure it makes sense to map the indexer to the Sourcetrail "custom command" screen, as the custom command screen expects to be given a list of files but then executes one by one? The alternative is to try and use the new incremental compile speedups, as if we were watching a project one file at a time, but that sounds somewhat risky. I'm also not sure how best to handle when typescript is found inside other project folders/types.

Anyway, you can have a look at https://github.com/LouisStAmour/SourcetrailTypescriptIndexer/ for my progress. My goal by tonight is to get the basics done so I can run the indexer directly and have it produce a Sourcetrail DB for some todomvc examples: https://github.com/tastejs/todomvc After that, I'll worry about more advanced TS projects and helping refactor the code for future maintenance...

@LouisStAmour
Copy link

LouisStAmour commented Nov 10, 2019

Okay. So ... I'm finding the model of Sourcetrail slightly hard to map to TypeScript/EcmaScript's model. Not impossible, just having to wrap my head around it.

I do wish the following graphics at https://www.sourcetrail.com/documentation/#Nodes were mapped to sourcetraildb representations more clearly. (all the way to the Graph Legend, etc.)

If I understand it correctly, in the following scenario:

image

I should have packages for "typescript", "@types/react", "csstype" (under a "node_modules" package?) and then ... obviously, files for each file. It's unclear to me whether I should use relative or absolute paths, though I suppose that will be obvious as I go. It'd be nice if I could add "folder" types, maybe. And keep node modules separate. The idea in JavaScript modules is that modules can be uniquely identified by the full file path they came from. And it looks like TypeScript's compiler automatically tracks symbols back to where they came from. Though I'm not sure if it automatically resolves source map files, probably not. It probably sticks to .d.ts files and falls back to the minified JS if it can interpret types at run-time. If I'd want to make it easier to display node_modules code or have it prefer to show the original source via sourcemaps, I'd have to do the math myself. If source maps aren't available, it's possible we've a reference to code that's next to impossible to display without "de-minification" to re-introduce line breaks to the JS.

But ... I feel like I am missing something for the mapping of TS types to Sourcetrail types. Sometimes in typescript you've "internal" or non-exported symbols, and other times you've exported symbols. Making it more confusing, sometimes exported symbols are callable but "undocumented" because they're not exported into .d.ts files. And I can't help but feel like I'm just scratching the surface of the complexity here. I kind of wish some of this were more clearly possible to express in the SourcetrailDB API. For instance, I'd probably want to see which functions are exported and which ones aren't.

Then there are things that seem hard to map, TypeScript has union types... https://www.typescriptlang.org/docs/handbook/advanced-types.html But unlike CPP, union types don't have members, they aren't actually distinct in that sense: https://docs.microsoft.com/en-us/cpp/cpp/unions?view=vs-2019 I can probably narrow down what type something is based on the source, as TS does a lot of magic inference on its own, but I'm not sure where to include details like what type a variable is at that moment vs how it was defined originally?

An object is inherently dynamic, and sometimes is used like a hash map or dictionary... confusingly, an array has similar properties and if you assign a non-integer to it, it automatically becomes an object (confusingly initialized as an Array!) I guess where I'm going with this is it can be near impossible to pin down a variable to a certain value in practice. A variable defined in one place will change as it's used elsewhere. So I'm not sure how we can best represent this in the UI. There's even an "EvolvingArray" type: https://stackoverflow.com/questions/52804806/how-does-typescript-infer-element-type-for-array-literals

Some of the oddness comes from Typescript having to handle the fact that it's a generic language (JS) underneath and TypeScript was designed to be pragmatic -- to assign types to a dynamic language leads to quite a bit of complexity. For instance, you can type a method signature as many times as you like, as long as there's a single implementation and the implementation's signature is flexible enough to handle every type given: https://www.typescriptlang.org/docs/handbook/functions.html#overloads I don't think this will cause a problem in practice, but it does make for odd reading at times.

The concept of .d.ts files can probably be compared to CPP header files, except they can't contain implementations, they're only declarations for code implemented elsewhere. Sometimes you've the .js source files on disk, sometimes they're mapped to runtime-provided APIs. (typescript/lib)

Given how important generics are to typescript, I'm wondering if I should instead be looking at a more complete C++ example: https://www.typescriptlang.org/docs/handbook/generics.html I'd assume I need to map TypeReference nodes from TS generics to TYPE_PARAMETER symbols?

Then you've the oddballs due to JS scoping. When you use var outside a function or forget to use var (in non-strict mode), you get a global. In older code, however, these are sometimes not used outside the file they're defined in, so while they're global, they're not meant to be (or are used as locals). I'm not sure then if I should actually mark these as global, or not. Or whether marking as global matters from a UI perspective? https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/var

All I can say is, TypeScript has ... a lot of syntax to map: https://github.com/microsoft/TypeScript/blob/master/lib/typescript.d.ts#L78

And that's not even counting having to interpret flags, decorators, modifiers: https://github.com/microsoft/TypeScript/blob/master/lib/typescript.d.ts#L496

It does look like most of what I want to pay attention to are resolved TS symbols and symbol tables, but those are in turn attached to the nodes in the syntax tree, and themselves have a lot of flags: https://github.com/microsoft/TypeScript/blob/master/lib/typescript.d.ts#L2164 Also, Symbol interfaces or Symbol objects are not to be confused with UniqueESSymbolType https://github.com/microsoft/TypeScript/blob/master/lib/typescript.d.ts#L2346 or ES5 Symbols. This had me stumped for awhile as I didn't catch the difference.

Also if all the above wasn't enough ... I don't think it affects anything in Sourcetrail yet, but you can use literals as types. So a string might be assigned the type "seven". It's still a string, but now it's a specific type of string. You can say "seven" | 7 if you also wanted to allow a number representation. The reason string literals matter: https://github.com/microsoft/TypeScript/blob/master/doc/spec.md#18-overloading-on-string-parameters Admittedly, in this example, there's a different type for HTMLSpanElement, but it's why they exist in the first place. They even came up with a way to make many types at the same time -- so you can use keyof to create a dynamic type from the keys of an object, and refer to the object later for the value, in this case of the return type: https://github.com/microsoft/TypeScript/blob/master/lib/lib.dom.d.ts#L4735

Then there's namespacing, a thing unique to TypeScript which pre-dates modern ES6 module named import/export. It has some fun concepts too, like aliases: https://www.typescriptlang.org/docs/handbook/namespaces.html#aliases Note that an alias can be independently modified ... which is similar to how an object in JS also can be modified independent of its class. Should these possible object modifications make their own classes? How should they be exposed to Sourcetrail, or shown to users?

In conclusion -- I know I'm overthinking this. But there's a lot of syntax to take in, and I know if I were writing the UI from scratch, I'd want to present more of it to the user, particularly (ideally) what type an object or symbol is determined to be at that moment. but I'm not sure if the only way I can do that is to say it's re-declared when the type changes? Short of mapping JS back to V8 C++ primitives, I'm not sure how to model some of this...

@LouisStAmour
Copy link

LouisStAmour commented Nov 10, 2019

And all this without mentioning that of course, you can have, and regularly have, multiple types of syntax and files in the same JS/TS project -- sometimes even in the same file -- including JSON, JavaScript, CSS, HTML, XML, TypeScript, JSX/TSX, Markdown, SCSS, and references to assets (images, fonts, web links). At various points you could make a graph from CSS class references in HTML and JS to CSS selector references back to HTML and JS, etc. And that's not to mention all the git history we're probably ignoring right now. ;-)

I don't know how in-depth Sourcetrail could be customized, but it'd be nice to organize the source by my own graph, such as by router routes in an MVC framework or React-Router, or ideally, to start from the app server initialization and HTML serving, and move straight into CSS import graphs, JS graphs, through frameworks like React and Redux to window.fetch/XMLHttpRequests and API calls, exposed API endpoints, then back-end code, SQL statements, to the database model. That would, of course, be the "holy grail" of visualizations, and probably at the start, only available to programmers looking at an AST and trying to make automated sense of a restricted syntax used in a particular codebase... I'm not saying it would be required here, but it would be a "nice-to-have" if the existing AST parsers were opened up for further customization, perhaps through a walker API and an extended SDK? Seems like an alternative might be https://docs.microsoft.com/en-us/visualstudio/modeling/directed-graph-markup-language-dgml-reference?view=vs-2019 but as I don't have VS Enterprise, I haven't extensively played with this at all. And it feels like we're getting dangerously close to OWL/RDF/ontology territory here... GraphML is also probably similar.

Finally, I'm not looking forward to writing up the syntax highlighting. Ideally, we'd have taken a VS Code approach and reused TextMate bundles (iirc) since they're very common and relatively portable, though not always "smart". The only other option I can think of would be to integrate an LSP and ideally it would provide syntax highlighting. I'm not sure how "smart" these get either.

@mlangkabel
Copy link
Contributor

mlangkabel commented Nov 10, 2019

Hi @LouisStAmour, thank you very much for the update. It took me a while to get through all of the things that you mentioned but here is my answer and advice. I hope I haven't missed too many of the things you addressed.

When we started out working on the Python indexer we thought it would be difficult because of the dynamic nature of Python, but tackling TypeScript seems to take "difficult" to a whole new level.

Yes, you are overthinking this ;) Don't try to focus too much on the language and on how the language is read by the compiler. Instead try to think about how the code is read by a user and about what will be useful to know for the user.

  • For example you were asking about generics. For the compiler it is surely essential to know about <T> and where it is replaced with what type. When we started working on Sourcetrail for C++, we thought the same way. But for the next Sourcetrail release we will change that behavior and only record a usage of a local symbol for occurrences of T, because for the user who reads code with Sourcetrail, T is only relevant within the context of the generic function (or generic type), so he doesn't need to see a graph node for T.
  • The same is true for unions. Try to think about what abstraction level will be useful for the user. Does the database really need to keep track of function a using type c via a union b? or is it enough a user sees that function a uses type c?

If you haven't read it already, make sure to read our language extension guide. Maybe it helps to answer some of your open questions. In general:

  • go from "easy to implement" (e.g. record static symbol definitions) to "hard to implement" (e.g. infer call to method of a type based of the keyof keyword)
  • if you cannot resove the name of a callee yet, just record a call to an unsolved symbol
    (the API has a method for that). In the Sourcetrail preferences you can activate the Add ambiguous edges for unsolved references setting. Lets make an example:
class X { baz():void {} }
class Y { baz():void {} }

function foo(){
	return bar.baz();
}
  • With that setting active, and a call to an unsolved symbol recorded at the location of bar.[baz]() ([ and ] indicates the source location that is recorded for the call edge), Sourcetrail would add two edges: one from foo to X.baz() and one from foo to Y.baz() and mark both edges ambiguous (they are getting displayed with dashed lines in the graph).
    At a later point of time, once you fingured out how to resolve which baz() method is actually called, you can implement that in your indexer and remove the unsolved edge.
  • Try to map all the kinds of nodes that you need for typescript to the node kinds provided by SourcetrailDB at first. E.g. of yourse having a module node kind would be nice, but
    • at short term, just record modules with node type package.
    • at mid term, create a new issue here on the SourcetrailDB repository where you request adding a module type
    • the long term plan is that SourcetrailDB gets rid of the fixed node kinds and provides a registerNodeKind() function that allows to configure node kinds as desired.
  • You can use the node kinds provided by SourcetraiDB however you want. But I would recommend that you record local symbols for variables that are only valid inside a scope and record global variable nodes for variables that can be accesses from different scopes. Knowing the location and usages of local symbols only matters when really reading the code, so they should only get highlighted in Sourcetrail's code view whereas global variables produce a clickable node in the graph that can be connected to different symbols (e.g. functions).
  • If different TS versions interpret the code differently, your options would be to limit your TS indexer to a certain version of TS or to add a command line flag to your indexer that allows specifying the TS version.
  • My advice for namespaces:
    • For namespaces just record a namespace node.
    • For aliases just record a new namespace node. For each location where the alias is assigned, add an edge that goes from the asias node to the node of the aliased namespace.
  • Exported types: Currently we cannot add custom information like this to a node. For the future, we could implement adding tags to nodes, so that searching for the tag within Sourcetrail would show a graph containing all nodes with this tag.
  • Syntax highlighting can already be customizes with rules (take a look at the data\syntax_highlighting_rules folder within Sourcetrail's install directory)
  • For the future we are planning to open up the UI a little bit more to developers of Sourcetrail language packages. The way to create custom node kinds that I mentioned above would be a start. But some kind of graph markup langue like the one you linked that can be modified by a language package would be really nice.

@LouisStAmour
Copy link

LouisStAmour commented Nov 10, 2019

Thanks for your quick reply! Letting me know the difference between globals and locals really helps, it’s the kind of info I need to try and understand how and why I would pick one description over another.

https://github.com/microsoft/TypeScript/wiki/Architectural-Overview Is quite useful too. I’m surprised it took me this long to find it and that I got so far without it. I’m going to have a look at the language server to see if it has any good examples on converting from AST to something more human-readable and useful.

I’m also going to focus much more on the symbol attribute attached to nodes — it appears to be the symbolic representation used by TS to determine whether types match and are appropriate. So it will likely give us the most information on what an AST node actually means. Secondarily to that are reporting types, and as they can get complicated, I’ll try to narrow the type just like TS does, to the best guess of what type the variable is at that moment. I’m still not sure if to represent union types I’ll have to report a variable as more than one type, somehow, but I will assume for now the API is flexible enough to handle whatever I need to toss at it.

It’s a good reminder to report just the information you would want to know if someone pointed at a node in the AST and asked “what’s this?” The idea that you want to focus less on makes sense. I can just imagine a lookup for alone is near useless and muddies the water when looking for usage of T. An ideal approach would be looking for the top-level generic, and if there’s too many results, filtering the list by . Similarly, if looking for T, you could narrow further by T[], or Array or [T] ... yes, those are all 3 ways of specifying an array of T in TypeScript. The first two are interchangeable. The last is most specific because it expects a size of 1, only containing T.

@LouisStAmour
Copy link

LouisStAmour commented Nov 11, 2019

Well, I figured I'd go back to basics and try to figure out Sourcetrail's ERD, how the various "kinds" relate to one another and to the graph's representation. Ended up with https://github.com/LouisStAmour/SourcetrailTypescriptIndexer/blob/master/docs/SourcetrailDB%20Schema.pdf where, if I were to summarize, a table named "element" is used to track unique IDs for pretty much every other table, except that elements have many components (ambiguous markers/internal data?) and elements have many source_locations (Source Range + Location Kind), while a source location can be mapped to one or more elements. What is an element then? The primary kinds are nodes and edges. Nodes are files or symbols, edges connect two nodes. Nodes are unique in that they can be named, while edges are distinguished by their types, sources and targets. Multiple edges of different types can connect the same nodes (symbols or files). Files have contents, symbols have kinds and definition kinds, and edges have reference kinds. Special types of edges include membership and aggregation, haven't traced these back to the relevant calls yet. Actually, my next goal was to look at which kinds make sense for TypeScript and which don't, as well as to map some of the API calls to table inserts. Other types of elements besides nodes and edges are errors and local_symbols, which have source locations, and possibly ambiguous, but are otherwise detached from the graph. (That's right, a local_symbol is not a symbol.) The source file for the graph is the graphml, edited/created interactively using yEd, based on a graphml file created in DBeaver.

@mlangkabel
Copy link
Contributor

mlangkabel commented Nov 11, 2019

Wow, this is awesome. Everything you deduced is correct. Here some more information:

  • The meta table is only used to store meta data to the database (e.g. database version for Sourcetrail's compatibility checks)
  • The access table is used to store the access modifier of a class' member (e.g. public, private, protected, default) and is not yet part of SourcetrailDB.
  • The element_component table is currently only used for ambiguous markers (as you expected), but in the future shall also store the content of the access table.
  • member edges connect a class to their members (methods and fields). These will be created automatically by SourcetrailDB when you record a symbol with a multi-level name hierarchy and are useful to speed up UI interations for Sourcetrail.
  • aggregation edges are actually not stored to the database. They are only used in Sourcetrail's graph to aggregate multiple edges between two symbols to remove clutter. They should be removed from the SoucetrailDB interface. I will create an issue for that.
  • local symbols are not part of the element table to keep the size of that table down. One could argue that local symbols and errors should also be elements because the also have source locations but for performance reasons we decided for the current approach.

@LouisStAmour
Copy link

Two things I've noted, before I go back to my, um, visual modelling, is that 1. I noticed after a "clear" that while most of my tables have cleared the element numbering shows 73 to 90, and the element table has entries from 1 to 90. And 2. local symbols and errors might not be intentionally part of the table, but it certainly creates a relationship: https://github.com/CoatiSoftware/SourcetrailDB/blob/master/core/src/DatabaseStorage.cpp#L479 and https://github.com/CoatiSoftware/SourcetrailDB/blob/master/core/src/DatabaseStorage.cpp#L524

@mlangkabel
Copy link
Contributor

Oh right, sorry for the misinformation. I still remembered an older state of the database. Looks like we already fixed that (before the error table and the local_symbol table had stored location information on their own).

@LouisStAmour
Copy link

LouisStAmour commented Nov 11, 2019

https://github.com/LouisStAmour/SourcetrailTypescriptIndexer/blob/master/docs/SourcetrailDB%20Schema.pdf

Okay, I've updated the above diagram with TypeScript API calls. Might have to double-count to make sure I didn't miss any. Looks like some of the Location Kinds aren't possible to set (yet), such as "fulltext search" or "screen search"? Presumably these are used internally, perhaps for screen state restoration, or as a form of prioritization.

I'll say that compared to the other calls, it feels a bit odd (a) that there are so many location API calls -- given the pattern of the others, it might be nicer to have a SymbolLocationKind class and be able to pass in Token (presumably), Scope, Qualifier, or Signature. Not a big deal, once you work it out mentally.

The real odd one is:

recordReferenceToUnsolvedSymhol(
   contextSymbolId: SymbolId,
   referenceKind: ReferenceKind,
   location: SourceRange
 ): ReferenceId;

It's like a recordReference and a recordReferenceLocation all at the same time. Seems unnecessarily complicated. Or is there a reason?

From my perspective, you record a reference, then record the reference location. Actually, maybe I'm thinking of this backwards. Perhaps a better question is -- when is location not required or not desired to be tracked? It would seem to me that 99% of the time, you'd want to track the location of everything added to the system. Because location is attached to element and everything we track is an element, it follows that everything has a location. The only things which might not need a location for obvious reasons are files, and it might follow that the only reason files are nodes is because in some languages, there might be a strong one-to-one mapping between -- actually wait, no, it doesn't make sense. Why are files considered nodes? Ah. Because you can have edges between files, or references from one file to another. Though ... you can't set that up in the API, so that files are a type of node is an implementation detail as far as the ERD is concerned. Okay, let me see if I can make that clearer in my diagram somehow.

Edit: I've updated the diagram so files is now closer to source location, and ambiguous references live closer to edges, and re-arranged spacing so reference locations live between references and source locations, without overlapping lines. Ideally, the symbol location APIs would live between symbols and source locations, but it's confusing if I maintain the database schema mapping of files to nodes when the API layer breaks that and only maps files to source locations.

Second edit: I've updated the diagram a second time to move symbol source location APIs between symbols and source locations. And manually touched up the label placement for ERD edges.

@LouisStAmour
Copy link

Now I only need to make sure I understand how the different kinds map to the Sourcetrail UI, again. I've a better idea why you record certain things -- the API is meant to be a bit lazy, in that you'd probably try to record as little as possible. And while you can say a file imports another file, it's not supported by API documentation. So ... I'm not sure how I should handle import * as ts from "typescript" or import "typescript". You're referencing an entire file... Ah, I see. I suppose I'd have a one-to-one mapping between files and a symbol representing the whole file as, say, a module. If files are modules, then I guess that makes folders simply part of the name element, with a name delimiter of "/". I'd probably need to look for node_modules folders though, to create package symbols for each node_modules folder. And this way I can reserve namespaces for Typescript namespaces, which have a specific meaning. Okay, I think I can see how this would work. I'll try again with the typescript in a bit. Not sure yet if I'll have time tomorrow, might have to wait a few days to a week...

@LouisStAmour
Copy link

LouisStAmour commented Nov 16, 2019

As an update, the typescript indexer is underway, I've got it looping through files, but I was getting bogged down in all the error-checking statements and having to maintain so many temporary IDs, so I decided to clean up the TypeScript API, and came up with a builder syntax: https://github.com/LouisStAmour/sourcetraildb/blob/master/src/builder.ts (This might not be final, I'll have to add documentation and may clean it up further as I use it.)

Here's the cpp_example.ts converted to the new builder syntax, which also throws exceptions, but takes advantage of this to return itself and allow for easy object and property declaration through chaining:

(Edit: Re-reading the builder syntax, the part that concerns me most is the createSymbol parameters. They’re not reading quite like English to me, but I’m not yet sure if the extra complexity of having another builder and an extra function call or two is worth it.)

import SourcetrailDB, { ReferenceKind, SymbolKind } from "./src/builder";

// open database by passing .srctrldb or .srctrldb_tmp path
SourcetrailDB.openAndClear(dbPath, writer => {
  console.log("Starting Indexing...");
  // record source file by passing it's absolute path
  const file = writer.createFile(sourcePath).asLanguage("cpp"); // for syntax highlighting

  // record atomic source range for multi line comment
  writer.recordAtomicSourceRange(file.at(2, 1, 6, 3));

  // record namespace "api"
  const namespace = writer
    .createSymbol(".", "api")
    .explicitly()
    .ofType(SymbolKind.NAMESPACE)
    .atLocation(file.at(8, 11, 8, 13))
    .withScope(file.at(8, 1, 24, 1));

  // record class "MyType"
  const className = namespace
    .createChildSymbol("MyType")
    .explicitly()
    .ofType(SymbolKind.CLASS)
    .atLocation(file.at(11, 7, 11, 12))
    .withScope(file.at(11, 1, 22, 1)); // gets highlight when active

  // record inheritance reference to "BaseType"
  writer
    .createSymbol(".", "BaseType")
    .isReferencedBy(className, ReferenceKind.INHERITANCE)
    .atLocation(file.at(12, 14, 12, 21));

  // add child method "void my_method() const"
  const method = className
    .createChildSymbol("void", "my_method", "() const")
    .explicitly()
    .ofType(SymbolKind.METHOD)
    .atLocation(file.at(15, 10, 15, 18))
    .withScope(file.at(15, 5, 21, 5)) // gets highlight when active
    .withSignature(file.at(15, 5, 15, 45)); // used in tooltip

  // record usage of parameter type "bool"
  writer
    .createSymbol(".", "bool")
    .isReferencedBy(method, ReferenceKind.TYPE_USAGE)
    .atLocation(file.at(15, 20, 15, 23));

  // record parameter "do_send_signal"
  writer
    .createLocalSymbol("do_send_signal")
    .atLocation(file.at(15, 25, 15, 38))
    .atLocation(file.at(17, 13, 17, 26));

  // record source range of "Client" as qualifier location
  const qualifier = writer
    .createSymbol(".", "Client")
    .withQualifier(file.at(19, 13, 19, 18));

  // record function call reference to "send_signal()"
  qualifier
    .createChildSymbol("", "send_signal", "()")
    .ofType(SymbolKind.FUNCTION)
    .isReferencedBy(method, ReferenceKind.CALL)
    .atLocation(file.at(19, 21, 19, 31));

  // record error
  writer.recordError(
    'Really? You missed that ";" again? (intentional error)',
    file.at(22, 1, 22, 1)
  );
});

console.log("done!");

@LouisStAmour
Copy link

LouisStAmour commented Nov 18, 2019

This little project is teaching me so much I didn't know or had forgotten about JavaScript. For instance, I was looking up why I couldn't refer to a name for my classes in a certain location and discovered: class expressions! https://www.typescriptlang.org/docs/handbook/release-notes/typescript-1-6.html#class-expressions

In a class expression, the class name is optional and, if specified, is only in scope in the class expression itself. This is similar to the optional name of a function expression. It is not possible to refer to the class instance type of a class expression outside the class expression, but the type can of course be matched structurally.

let Point = class {
    constructor(public x: number, public y: number) { }
    public length() {
        return Math.sqrt(this.x * this.x + this.y * this.y);
    }
};
var p = new Point(3, 4);  // p has anonymous class type
console.log(p.length());

At times like these, I wonder if an AST is simply more work than a simple graph of symbols to declarations... In this case, the class doesn't have a name, the variable the class is assigned to does. What the how in the .... I'm reminded of anonymous functions. Or anonymous classes in Java.

Searching around, I found this for C++ CoatiSoftware/Sourcetrail#189 which highlights that a user-friendly way to handle this is to add the members to the graph, as if they belong to the parent or assigned variable as members. That makes sense to me, I suppose in the graph it's easier to think of an anonymous class as a class with a "namespace" or a class with a variable name that isn't global to the current scope. It ... would complicate variable assignment, I suppose. Sometimes a variable assignment would be a class definition. Joy. :)

@LouisStAmour
Copy link

image

Okay, so I'm making progress, but it feels like I'm doing it the hardest way possible, pretty much because I don't know any better yet. I've copied over the code for the language server when it's building an outline of the source code for a file. But I've replaced each call to build the outline with a debugger breakpoint, and as the AST walker hits each breakpoint, I manually inspect the match to figure out what to do.

So far, I haven't left the global definitions of lib d.ts files, so I've restricted my matches currently to such files where I find them, as I'm creating a lot of GLOBAL_VARIABLES but in the process, I feel like my graph isn't as rich as it could be. And once I'm done with globals, I'll have to hit a point where I'm creating node names that mix both filenames ("/" delimiters) and JS/TS variable scope ("." delimiters) and I'm not yet sure how I'll handle that: these exist in packages/folders which might be nice to navigate if I hard-code one level, but if I use multi-level, or write in the filename to the scope or name of the variable, well, normally in TS we don't consider the files to be modules -- we don't display filenames as part of variable names -- but we treat them like that when editing. If not in ES6 modules mode, then we'd probably fall back to globals or local function scope. The way we tend to think of it is "symbolName defined in file path" the same way you'd "import (name or alias) from (file)". There's also a difference between an exported symbol and one that isn't. So it would make sense that only exported symbols are scoped to the file, while other symbols are considered local. But local symbols aren't rich enough to be part of the graph -- local symbols can't have edges, you can't record calls to local symbols, they're not nodes. So if an exported function is complicated enough to call multiple local functions, and maybe a local class, there's no easy way to define this in Sourcetrail right now outside of adding extra noise to the Name Hierarchy in the form of file paths to make TS symbols unique to the files they're defined in...

@LouisStAmour
Copy link

An update, now that Sourcetrail's open source, this should speed up the work I was doing above, as I'll have a better idea of how this fits in, and maybe I can suggest parts of a PR to better support TypeScript in Sourcetrail's UI/graph types. While I do like the consistency of the existing types, it's worth trying to import a slightly more complicated graph and see how it plays out visually, where the seams are that need to be ironed out.

@mlangkabel
Copy link
Contributor

Sorry for keeping quiet for a while. We had a lot of work going on with opens-sourcing Sourcetrail.

I don't know what you have decided to do now, but from what it sounds like: if you have a way of indexing TS code that reads in multiple files at once, use that way. That's the way that sounds at least like it is capable of solving relations between symbols.

Looking at the screenshot from above: I'm not familiar with TS, but to me the code just looks like it is only definitions. No usages, calls, etc there. So it is just natural that the graph looks this way ;)

Maybe just take a look at our Java indexer (it is much easier to understand than the C++ indexer). In the Java indexer the AstVisitor does all the work.

For example take a look at boolean AstVisitor.visit(SimpleName node). Here we just get an object of type SimpleName from the AST (that only describes the original identifier (lets call it foo) that was used in the code and the location where it was used) and call

IBinding binding = node.resolveBinding();

This "converts" the AST node to a VariableBinding that we then can ask for the variable's definition.
And then we record a usage of the local symbol foo at the location of the SimpleName (e.g. we call getRange() on node but toNameHierarchy() on declName (<- that's how we get the original name of the used decl)!

m_client.recordLocalSymbol(declName.toNameHierarchy(), getRange(node));

Regarding file and symbol name delimiter: don't worry about this now. just use a . for everything now. We can clean it up, once we manage to record relations properly.

I hope this helps ;)

@LouisStAmour
Copy link

Thanks! Sorry for the radio silence on my end also :) Things got suddenly busy at work and I had folks requesting C# and XSLT analysis so I had to drop the TS analysis temporarily. I will get back to it, but possibly not until the holiday break. Instead I'm learning all this Roslyn and XSLT 1.0, fun! ;-) Might contribute to #17 when I get the chance.

@phloose
Copy link

phloose commented Dec 12, 2019

I am interested in helping with the TypeScriptIndexer, although i am not that much experienced with TypeScript and C++, but see this as a step forward to learn both in more detail!

@LouisStAmour i tried running your indexer but was only succesful to index via the commandline. When i try this inside the project settings (Custom Command) it does not work. When starting to index it either returns error code 1 when run via:

  • cmd.exe /c ts-node /path/to/typescriptIndexer/bin/index file -f %{SOURCE_FILE_PATH} -d %{DATABASE_FILE_PATH}

or when run without that:

  • command %someCustomCommand% returned with message "File not found or resource error occured".

When i run the command from the error message in the commandline i get no error and the database is build 🤔. Any suggestions what i am doing wrong? Or is the indexer not at the step to be used inside the project settings?

Maybe @mlangkabel also has suggestions about how to insert a command in the correct way?

@JFQueralt
Copy link

Has there been any movement on this front?

@pidgeon777
Copy link

I'm also interested.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bindings request enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants