src/ftools.sthlp

{smcl}
{* *! version 2.49.1 08aug2023}{...}
{vieweralsosee "fegen" "help fegen"}{...}
{vieweralsosee "fcollapse" "help fcollapse"}{...}
{vieweralsosee "join" "help join"}{...}
{vieweralsosee "fmerge" "help fmerge"}{...}
{vieweralsosee "flevelsof" "help flevelsof"}{...}
{vieweralsosee "fisid" "help fisid"}{...}
{vieweralsosee "fsort" "help fsort"}{...}
{vieweralsosee "" "--"}{...}
{vieweralsosee "[R] egen" "help egen"}{...}
{vieweralsosee "[R] collapse" "help collapse"}{...}
{vieweralsosee "[R] contract" "help contract"}{...}
{vieweralsosee "[R] merge" "help merge"}{...}
{vieweralsosee "[R] levelsof" "help levelsof"}{...}
{vieweralsosee "[R] sort" "help sort"}{...}
{vieweralsosee "" "--"}{...}
{vieweralsosee "moremata" "help moremata"}{...}
{vieweralsosee "reghdfe" "help reghdfe"}{...}
{viewerjumpto "Syntax" "ftools##syntax"}{...}
{viewerjumpto "Creation" "ftools##creation"}{...}
{viewerjumpto "Properties and methods" "ftools##properties"}{...}
{viewerjumpto "Description" "ftools##description"}{...}
{viewerjumpto "Usage" "ftools##usage"}{...}
{viewerjumpto "Example" "ftools##example"}{...}
{viewerjumpto "Remarks" "ftools##remarks"}{...}
{viewerjumpto "Using functions from collapse" "ftools##collapse"}{...}
{viewerjumpto "Experimental/advanced" "ftools##experimental"}{...}
{viewerjumpto "Source code" "ftools##source"}{...}
{viewerjumpto "Author" "ftools##contact"}{...}

{title:Title}

{p2colset 5 15 20 2}{...}
{p2col :{cmd:FTOOLS} {hline 2}}Mata commands for factor variables{p_end}
{p2colreset}{...}

{marker syntax}{...}
{title:Syntax}

{p 8 16 2}
{it:class Factor scalar}
{bind: }{cmd:factor(}{space 3}{it:varnames} [{space 1}
{cmd:,}
{it:touse}{cmd:,} 
{it:verbose}{cmd:,} 
{it:method}{cmd:,} 
{it:sort_levels}{cmd:,} 
{it:count_levels}{cmd:,} 
{it:hash_ratio}{cmd:,}
{it:save_keys}]{cmd:)}

{p 8 16 2}
{it:class Factor scalar}
{bind: }{cmd:_factor(}{it:data} [{cmd:,}
{it:integers_only}{cmd:,} 
{it:verbose}{cmd:,} 
{it:method}{cmd:,} 
{it:sort_levels}{cmd:,} 
{it:count_levels}{cmd:,} 
{it:hash_ratio}{cmd:,}
{it:save_keys}]{cmd:)}

{p 8 16 2}
{it:class Factor scalar}
{bind: }{cmd:join_factors(}{it:F1}{cmd:,}
{it:F2} [{cmd:,}
{it:count_levels}{cmd:,} 
{it:save_keys}{cmd:,}
{it:levels_as_keys}]{cmd:)}


{marker arguments}{...}
{synoptset 38 tabbed}{...}
{synopthdr}
{synoptline}
{p2coldent:* {it:string} varnames}names of variables that identify the factors{p_end}
{synopt:{it:string} touse}name of dummy {help mark:touse} variable{p_end}
{p2coldent:}{bf:note:} you can also pass a vector with the obs. index (i.e. the first argument of {cmd:st_data()}){p_end}
{synopt:{it:string} data}transmorphic matrix with the group identifiers{p_end}

{synopt:{bf:Advanced options:}}{p_end}
{synopt:{it:real} verbose}1 to display debug information{p_end}
{synopt:{it:string} method}hashing method: mata, hash0, hash1, hash2; default is {it:mata} (auto-choose){p_end}
{synopt:{it:real} sort_levels}set to 0 under {it:hash1} to increase speed, but the new levels will not match the order of the varlist{p_end}
{synopt:{it:real} count_levels}set to 0 under {it:hash0} to increase speed, but the {it:F.counts} vector will not be generated
so F{cmd:.panelsetup()}, F{cmd:.drop_obs()}, and related methods will not be available{p_end}
{synopt:{it:real} hash_ratio}size of the hash vector compared to the maximum number of keys (often num. obs.){p_end}
{synopt:{it:real} save_keys}set to 0 to increase speed and save memory,
but the matrix {it:F.keys} with the original values of the factors
won't be created{p_end}
{synopt:{it:string} integers_only}whether {it:data} is numeric and takes only {it:integers} or not (unless you are sure of the former, set it to 0){p_end}
{synopt:{it:real} levels_as_keys}if set to 1,
{cmd:join_factors()} will use the levels of F1 and F2
as the keys (as the data) when creating F12{p_end}
{p2colreset}{...}


{marker creation}{...}
{title:Creating factor objects}

{pstd}(optional) First, you can declare the Factor object:

{p 8 8 2}
{cmd:class Factor scalar}{it: F}{break}

{pstd}Then, you can create a factor from one or more categorical variables:

{p 8 8 2}
{it:F }{cmd:=}{bind: }{cmd:factor(}{it:varnames}{cmd:)}

{pstd}
If the categories are already in Mata
({cmd:data = st_data(., varnames)}), you can do:

{p 8 8 2}
{it:F }{cmd:=}{bind: }{cmd:_factor(}{it:data}{cmd:)}

{pstd}
You can also combine two factors ({it:F1} and {it:F2}):

{p 8 8 2}
{it:F }{cmd:=}{bind: }{cmd:join_factors(}{it:F1}{cmd:,} {it:F2}{cmd:)}

{pstd}
Note that the above is exactly equivalent (but faster) than:

{p 8 8 2}
{it: varnames} {cmd:= invtokens((}{it:F1.varnames}{cmd:,} {it:F2.varnames}{cmd:))}{break}
{it:F} {cmd:=} {cmd:factor(}{it:varnames}{cmd:)}

{pstd}
If {it:levels_as_keys==1}, it is equivalent to:

{p 8 8 2}
{it:F }{cmd:=}{bind: }{cmd:_factor((}{it:F1.levels}{cmd:,} {it:F2.levels}{cmd:))}


{marker properties}{...}
{title:Properties and Methods}

{marker arguments}{...}
{synoptset 38 tabbed}{...}

{synopthdr:properties}
{synoptline}
{synopt:{it:real} F{cmd:.num_levels}}number of levels (distinct values) of the factor{p_end}
{synopt:{it:real} F{cmd:.num_obs}}number of observations of the sample used to create the factor ({cmd:c(N)} if touse was empty){p_end}
{synopt:{it:real colvector} F{cmd:.levels}}levels of the factor; dimension {cmd:F.num_obs x 1}; range: {cmd:{1, ..., F.num_levels}}{p_end}
{synopt:{it:transmorphic matrix} F{cmd:.keys}}values of the input varlist that correspond to the factor levels;
dimension {cmd:F.num_levels x 1}; not created if save_keys==0; unordered if sort_levels==0{p_end}
{synopt:{it:real vector} F{cmd:.counts}}frequencies of each level (in the sample set by touse);
dimension {cmd:F.num_levels x 1}; will be empty if count_levels==0{p_end}

{synopt:{it:string rowvector} F{cmd:.varlist}}name of variables used to create the factor{p_end}
{synopt:{it:string rowvector} F{cmd:.varformats}}formats of the input variables{p_end}
{synopt:{it:string rowvector} F{cmd:.varlabels}}labels of the input variables{p_end}
{synopt:{it:string rowvector} F{cmd:.varvaluelabels}}value labels attached to the input variables{p_end}
{synopt:{it:string rowvector} F{cmd:.vartypes}}types of the input variables{p_end}
{synopt:{it:string rowvector} F{cmd:.vl}}value label definitions used by the input variables{p_end}
{synopt:{it:string} F{cmd:.touse}}name of touse variable{p_end}
{synopt:{it:string} F{cmd:.is_sorted}}1 if the dataset is sorted by F{cmd:.varlist}{p_end}


{synopthdr:main methods}
{synoptline}
{synopt:{it:void} F{cmd:.store_levels(}{newvar}{cmd:)}}save
the levels back into the dataset (using the same {it:touse}){p_end}
{synopt:{it:void} F{cmd:.store_keys(}[{it:sort}]{cmd:)}}save
the original key variables into a reduced dataset, including formatting and labels. If {it:sort} is 1, Stata will report the dataset as sorted{p_end}
{synopt:{it:void} F{cmd:.panelsetup()}}compute auxiliary vectors {it:F.info}
and {it:F.p} (see below); used in panel computations{p_end}


{synopthdr:ancilliary methods}
{synoptline}
{synopt:{it:real scalar} F{cmd:.equals(}F2{cmd:)}}1
if {it:F} represents the same data as {it:F2}
(i.e. if .num_obs .num_levels .levels .keys and .counts are equal)
{p_end}
{synopt:{it:real scalar} F{opt .nested_within(vec)}}1
if the factor {it:F} is
{browse "http://scorreia.com/software/reghdfe/faq.html#what-does-fixed-effect-nested-within-cluster-means":nested within}
the column vector {it:vec}
(i.e. if any two obs. with the same factor level also have the same value of {it:vec}).
For instance, it is true if the factor {it:F} represents counties and {it:vec} represents states.
{p_end}
{synopt:{it:void} F{cmd:.drop_obs(}{it:idx}{cmd:)}}update
{it:F} to reflect a change in the underlying dataset, where
the observations listed in the column vector {it:idx} are dropped
(see example below)
{p_end}
{synopt:{it:void} F{cmd:.keep_obs(}{it:idx}{cmd:)}}equivalent
to keeping only the obs. enumerated by {it:idx} and recreating {it:F};
uses {cmd:.drop_obs()}
{p_end}
{synopt:{it:void} F{cmd:.drop_if(}{it:vec}{cmd:)}}equivalent
to dropping the obs. where {it:vec==0} and recreating {it:F};
uses {cmd:.drop_obs()}
{p_end}
{synopt:{it:void} F{cmd:.keep_if(}{it:vec}{cmd:)}}equivalent
to keeping the obs. where {it:vec!=0} and recreating {it:F};
uses {cmd:.drop_obs()}
{p_end}
{synopt:{it:real colvector} F{cmd:.drop_singletons()}}equivalent
to dropping the levels that only appear once,
and their corresponding observations.
The colvector returned contains the observations that need to be excluded
(note: see the source code for some advanced optional arguments).
{p_end}
{synopt:{it:real scalar} F{opt .is_id()}}1
if {it:F.counts} is always 1
(i.e. if {it:F.levels} has no duplicates)
{p_end}
{synopt:{it:real vector} F{cmd:.intersect(}{it:vec}{cmd:)}}return
a mask vector equal to 1 if the row of {it:vec} is also on F.keys.
Also accepts the integers_only and verbose options: {it:mask = F.intersect(y, 1, 1)}
{p_end}


{synopthdr:available after F.panelsetup()}
{synoptline}
{synopt:{it:transmorphic matrix} F{cmd:.sort(}{it:data}{cmd:)}}equivalent to
{cmd:data[F.p, .]}
but calls {cmd:F.panelsetup()} if required; {it:data} is a {it:transmorphic matrix}{p_end}
{synopt:{it:transmorphic matrix} F{cmd:.invsort(}{it:data}{cmd:)}}equivalent to
{cmd:data[invorder(F.p), .]}, so it undoes a previous sort operation. Note that {cmd:F.invsort(F.sort(x))==x}. Also, after used it fills the vector {cmd:F.inv_p = invorder(F.p)} so the operation can be repeated easily.
{p_end}
{synopt:{it:void} F{cmd:._sort(}{it:data}{cmd:)}}in-place version of
{cmd:.sort()};
slower but uses less memory, as it's based on {cmd:_collate()}{p_end}
{synopt:{it:real vector} F{cmd:.info}}equivalent to {help mf_panelsetup:panelsetup()}
(returns a {it:(num_levels X 2)} matrix with start and end positions of each level/panel).{p_end}
{p2coldent:}{bf:note:} instead of using {cmd:F.info} directly, use panelsubmatrix():
{cmd:x = panelsubmatrix(X, i, F.info)} and {cmd:panelsum()}(see example at the end){p_end}
{synopt:{it:real vector} F{cmd:.p}}equivalent to {cmd:order(F.levels)}
but implemented with a counting sort that is asymptotically
faster ({it:O(N)} instead of {it:O(N log N)}.{p_end}
{p2coldent:}{bf:note:} do not use {cmd:F.p} directly, as it will be missing if the data is already sorted by the varnames.{p_end}
{p2colreset}{...}


{pstd}Notes:

{synoptset 3 tabbed}{...}
{synopt:- }If you just downloaded the package and want to use the Mata functions directly (instead of the Stata commands), run {stata ftools} once to, which creates the Mata library if needed.{p_end}
{synopt:- }To force compilation of the Mata library, type {stata ftools, compile}{p_end}
{synopt:- }{cmd:F.extra} is an undocumented {help mf_asarray:asarray}
that can be used to store additional information: {cmd:asarray(f.extra, "lorem", "ipsum")};
and retrieve it: {cmd:ipsum = asarray(f.extra, "lorem")}{p_end}
{synopt:- }{cmd:join_factors()} is particularly fast if the dataset is sorted in the same order as the factors{p_end}
{synopt:- }{cmd:factor()} will call {cmd:join_factors()} if appropriate
(2+ integer variables; 10,000+ obs; and method=hash1)
{p_end}


{marker description}{...}
{title:Description}

{pstd}
The {it:Factor} object is a key component of several commands that
manipulate data without having to sort it beforehand:

{pmore}- {help fcollapse} (alternative to collapse, contract, collapse+merge and some egen functions){p_end}
{pmore}- {help fegen:fegen group}{p_end}
{pmore}- {help fisid}{p_end}
{pmore}- {help join} and {help fmerge} (alternative to m:1 and 1:1 merges){p_end}
{pmore}- {help flevelsof} plug-in alternative to {help levelsof}{p_end}
{pmore}- {help fsort} (note: this is O(N) but with a high constant term){p_end}
{pmore}- freshape{p_end}

Ancilliary commands include:

{pmore}- {help local_inlist} return local {it:inlist} based on a variable and a list of values or labels{p_end}

{pstd}
It rearranges one or more categorical variables into a new variable that takes values from 1 to F.num_levels. You can then efficiently sort any other variable by this, in order to compute groups statistics and other manipulations.

{pstd}
For technical information, see
{browse "http://stackoverflow.com/questions/8991709/why-are-pandas-merges-in-python-faster-than-data-table-merges-in-r/8992714#8992714":[1]}
{browse "http://wesmckinney.com/blog/nycpython-1102012-a-look-inside-pandas-design-and-development/":[2]},
and to a lesser degree
{browse "https://my.vertica.com/docs/7.1.x/HTML/Content/Authoring/AnalyzingData/Optimizations/AvoidingGROUPBYHASHWithProjectionDesign.htm":[3]}.


{marker usage}{...}
{title:Usage}

{pstd}
If you only want to create identifiers based on one or more variables,
run something like:

{inp}
    {hline 60}
    sysuse auto, clear
    mata: F = factor("foreign turn")
    mata: F.store_levels("id")
    mata: mata drop F
    {hline 60}
{txt}

{pstd}
More complex scenarios would involve some of the following:

{inp}
    {hline 60}
    sysuse auto, clear

    * Create factors for foreign data only
    mata: F = factor("turn", "foreign")

    * Report number of levels, obs. in sample, and keys
    mata: F.num_levels
    mata: F.num_obs
    mata: F.keys, F.counts

    * View new levels
    mata: F.levels[1::10]
    
    * Store back new levels (on the same sample)
    mata: F.store_levels("id")
    
    * Verify that the results are correct
    sort id
    li turn foreign id in 1/10
    {hline 60}
{txt}


{marker example}{...}
{title:Example: operating on levels of each factor}

{pstd}
This example shows how to process data for each level of the factor (like {help bysort}). It does so by combining {cmd:F.sort()} with {help mf_panelsetup:panelsubmatrix()}.
{p_end}

{pstd}
In particular, this code runs a regression for each category of {it:turn}:
{p_end}

{inp}
    {hline 60}
    clear all
    mata:
    real matrix reg_by_group(string depvar, string indepvars, string byvar)
    {
    	class Factor scalar			F
    	real scalar				i
    	real matrix				X, Y, x, y, betas
    
    	F = factor(byvar)
    	Y = F.sort(st_data(., depvar))
    	X = F.sort(st_data(., tokens(indepvars)))
    	betas = J(F.num_levels, 1 + cols(X), .)
    	
    	for (i = 1; i <= F.num_levels; i++) {
    		y = panelsubmatrix(Y, i, F.info)
    		x = panelsubmatrix(X, i, F.info) , J(rows(y), 1, 1)
    		betas[i, .] = qrsolve(x, y)'
    	}
    	return(betas)
    }
    end
    sysuse auto
    mata: reg_by_group("price", "weight length", "foreign")
    {hline 60}
{text}


{marker example2}{...}
{title:Example: Factors nested within another variable}

{pstd}
You might be interested in knowing if a categorical variable is nested within another, more coarser, variable.
For instance, a variable containing months ("Jan2017") is nested within another containing years ("2017")),
a variable containing counties ("Durham County, NC") is nested within another containing states ("North Carolina"), and so on.
{p_end}

{pstd}
To check for this, you can follow this example:
{p_end}

{inp}
    {hline 60}
    sysuse auto
    gen turn10 = int(turn/10)
    
    mata:
        F = factor("turn")
        F.nested_within(st_data(., "trunk")) // False
        F.nested_within(st_data(., "turn")) // Trivially true
        F.nested_within(st_data(., "turn10")) // True
    end
    {hline 60}
{txt}

{pstd}
You can also compare two factors directly:
{p_end}

{inp}
    {hline 60}
    mata:
        F1 = factor("turn")
        F2 = factor("turn10")
        F1.nested_within(F2.levels) // True
    end
    {hline 60}
{txt}


{marker example3}{...}
{title:Example: Updating a factor after dropping variables}

{pstd}
If you change the underlying dataset you have to recreate the factor, which is costly. As an alternative, you can use {cmd:.keep_obs()} and related methods:
{p_end}

{inp}
    {hline 60}
    * Benchmark
    sysuse auto, clear
    drop if price > 4500
    mata: F1 = factor("turn")
    // Quickly inspect results
    mata: F1.num_obs, F1.num_levels, hash1(F1.levels)
    
    * Using F.drop_obs()
    sysuse auto, clear
    mata
        price = st_data(., "price")
        F2 = factor("turn")
        idx = selectindex(price :> 4500)
        mata: F2.num_obs, F2.num_levels, hash1(F2.levels)
        F2.drop_obs(idx)
        mata: F2.num_obs, F2.num_levels, hash1(F2.levels)
        assert(F1.equals(F2))
    end
    
    * Using the other methods
    mata
        F2 = factor("turn")
        idx = selectindex(price :<= 4500)
        F2.keep_obs(idx)
        assert(F1.equals(F2))
    
        F2 = factor("turn")
        F2.drop_if(price :> 4500)
        assert(F1.equals(F2))
    
        F2 = factor("turn")
        F2.keep_if(price :<= 4500)
        assert(F1.equals(F2))
    end
    {hline 60}
{txt}


{marker remarks}{...}
{title:Remarks}

{pstd}
All-numeric and all-string varlists are allowed, but
hybrid varlists (where some but not all variables are strings) are not possible
due to Mata limitations.
As a workaround, first convert the string variables to numeric (e.g. using {cmd:fegen group()}) and then run your intended command.

{pstd}
You can pass as {varlist} a string like "turn trunk"
or a tokenized string like ("turn", "trunk").

{pstd}
To generate a group identifier, most commands first sort the data by a list of keys (such as {it:gvkey, year}) and then ask if the keys differ from one observation to the other.
Instead, {cmd:ftools} exploits the insights that sorting the data is not required to create an identifier,
and that once an identifier is created, we can then use a {it:counting sort} to sort the data in {it:O(N)} time instead of {it:O log(N)}.

{pstd}
To create an identifier (that takes a value in {1, {it:#keys}}) we first match each key (composed by one or more numbers and strings) into a unique integer.
 For instance, the key {it:gvkey=123, year=2010} is assigned the integer {it:4268248869} with the Mata function {cmd:hash1}.
 This identifier can then be used as an index when accessing vectors, bypassing the need for sorts.

{pstd}
The program tries to pick the hash function that best matches the dataset and input variables.
For instance, if the input variables have a small range of possible values (e.g. if they are of {it:byte} type), we select the {it:hash0} method, which uses a (non-minimal) perfect hashing but might consume a lot of memory.
Alternatively, {it:hash1} is used, which adds {browse "https://www.wikiwand.com/en/Open_addressing":open addressing} to Mata's
{help mf_hash1:hash1} function to create a form of open addressing (that is more efficient than Mata's {help mf_asarray:asarray}).


{marker collapse}{...}
{title:Using the functions from {it:fcollapse}}

{pstd}
You can access the {cmd:aggregate_*()} functions so you can collapse information without resorting to Stata. Example:

{inp}
    {hline 60}
    sysuse auto, clear
    mata: F = factor("turn")
    mata: F.panelsetup()
    mata: y = st_data(., "price")
    mata: sum_y = aggregate_sum(F, F.sort(y), ., "")
    mata: F.keys, F.counts, sum_y
    
    * Benchmark
    collapse (sum) price, by(turn)
    list
    {hline 60}
{txt}

Functions start with {cmd:aggregate_*()}, and are listed {view fcollapse_functions.mata, adopath asis:here}


{marker experimental}{...}
{title:Experimental/advanced functions}

{p 8 16 2}
{it:real scalar}
{bind: }{cmd:init_zigzag(}{it:F1}{cmd:,}
{it:F2}{cmd:,}
{it:F12}{cmd:,}
{it:F12_1}{cmd:,}
{it:F12_2}{cmd:,}
{it:queue}{cmd:,} 
{it:stack}{cmd:,}
{it:subgraph_id}{cmd:,}
{it:verbose}{cmd:)}

{pstd}Notes:

{synoptset 3 tabbed}{...}
{synopt:- }Given the bipartite graph formed by F1 and F2,
the function returns the number of disjoin subgraphs (mobility groups){p_end}
{synopt:- }F12 must be set with levels_as_keys==1{p_end}
{synopt:- }For F12_1 and F12_2, you can set save_keys==0{p_end}
{synopt:- }The function fills three useful vectors: queue, stack and subgraph_id{p_end}
{synopt:- }If subgraph_id==0, it the id vector will not be created{p_end}


{marker source}{...}
{title:Source code}

{pstd}
{view ftools.mata, adopath asis:ftools.mata};
{view ftools_type_aliases.mata, adopath asis:ftools_type_aliases.mata};
{view ftools_main.mata, adopath asis:ftools_main.mata};
{view ftools_bipartite.mata, adopath asis:ftools_bipartite.mata}
{view fcollapse_functions.mata, adopath asis:fcollapse_functions.mata}
{p_end}

{pstd}
Also, the latest version is available online: {browse "https://github.com/sergiocorreia/ftools/source"}


{marker author}{...}
{title:Author}

{pstd}Sergio Correia{break}
{break}
{browse "http://scorreia.com"}{break}
{browse "mailto:sergio.correia@gmail.com":sergio.correia@gmail.com}{break}
{p_end}


{marker project}{...}
{title:More Information}

{pstd}{break}
To report bugs, contribute, ask for help, etc. please see the project URL in Github:{break}
{browse "https://github.com/sergiocorreia/ftools"}{break}
{p_end}


{marker acknowledgment}{...}
{title:Acknowledgment}

{pstd}
This project was largely inspired by the works of
{browse "http://wesmckinney.com/blog/nycpython-1102012-a-look-inside-pandas-design-and-development/":Wes McKinney}, 
{browse "http://www.stata.com/meeting/uk15/abstracts/":Andrew Maurer}
and
{browse "https://ideas.repec.org/c/boc/bocode/s455001.html":Benn Jann}.
{p_end}